issue2551328/issue2551264 unneeded next link and total_count incorrect

rouilj · rouilj · commit b6dc3fff86e5 · 2024-04-01T09:57:16.000-04:00
Fix: issue2551328 - REST results show next link if number of
     results is a multiple of page size. (Found by members of
     team 3 in the UMass-Boston CS682 Spring 2024 class.)

     issue2551264 - REST X-Total-Count header and @total_size
     count incorrect when paginated

These issues arose because we retrieved the exact number of rows
from the database as requested by the user using the @page_size
parameter. With this changeset, we retrieve up to 10 million + 1
rows from the database. If the total number of rows exceeds 10
million, we set the total_count indicators to -1 as an invalid
size.  (The max number of requested rows (default 10 million +1)
can be modified by the admin through interfaces.py.)

By retrieving more data than necessary, we can calculate the
total count by adding @page_index*@page_size to the number of
rows returned by the query.

Furthermore, since we return more than @page_size rows, we can
determine the existence of a row at @page_size+1 and use that
information to determine if a next link should be
provided. Previously, a next link was returned if @page_size rows
were retrieved.

This change does not guarantee that the user will get @page_size
rows returned. Access policy filtering occurs after the rows are
returned, and discards rows inaccessible by the user.

Using the current @page_index/@page_size it would be difficult to
have the roundup code refetch data and make sure that a full
@page_size set of rows is returned. E.G. @page_size=100 and 5 of
them are dropped due to access restrictions. We then fetch 10
items and add items 1-4 and 6 (5 is inaccessible). There is no
way to calculate the new database offset at:
@page_index*@page_size + 6 from the URL. We would need to add an
@page_offset=6 or something.

This could work since the client isn't adding 1 to @page_index to
get the next page. Thanks to HATEOAS, the client just uses the
'next' url. But I am not going to cross that bridge without a
concrete use case.

This can also be handled client side by merging a short response
with the next response and re-paginating client side.

Also added extra index markers to the docs to highlight use of
interfaces.py.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -105,6 +105,14 @@ Fixed:
   which clobbers the info about the template issue. As a stop-gap set
   the line number to -1 so the original traceback can be seen. This
   could be a bug in ZopeTAL. (John Rouillard)
+- issue2551328 - REST results show next link if number of results is a
+  multiple of page size. There should be no next link. (Found by Patel
+  Malav and ... of the UMass-Boston CS682 Spring 2024 class; fix John
+  Rouillard)
+- issue2551264 - REST X-Total-Count header and @total_size count
+  incorrect when paginated - correct values are now returned.
+  (John Rouillard)
+
 
 Features:
 
diff --git a/doc/admin_guide.txt b/doc/admin_guide.txt
@@ -259,6 +259,8 @@ to compress responses on the fly. The python standard library includes
 gzip support. For brotli or zstd you will need to install packages. See
 the `installation documentation`_ for details.
 
+.. index:: single: interfaces.py; configuring http compression
+
 Some assets will not be compressed on the fly. Assets with mime types
 of "image/png" or "image/jpeg" will not be compressed. You
 can add mime types to the list by using ``interfaces.py`` as discussed
@@ -320,6 +322,31 @@ get the gzip version and not a brotli compressed version. This
 mechanism allows the admin to allow use of brotli and zstd for
 dynamic content, but not for static content.
 
+.. index:: single: interfaces.py; setting REST maximum result limit
+
+Configuring REST Maximum Result Limit
+=====================================
+
+To prevent denial of service (DOS) and limit user wait time for an
+unbounded request, the REST endpoint has a maximum limit on the number
+of rows that can be returned. By default, this is set to 10 million.
+This setting applies to all users of the REST interface. If you want
+to change this limit, you can add the following code to the
+``interfaces.py`` file in your tracker::
+
+    # change max response rows
+    from roundup.rest import RestfulInstance
+    RestfulInstance.max_response_row_size = 26
+
+This code will set the maximum number of rows to 25 (one less than the
+value). Note that this setting is rarely used and is not available in
+the tracker's ``config.ini`` file. Setting it through this mechanism
+allows you to enter a string or number that may break Roundup, such as
+"asdf" or 0. In general, it is recommended to keep the limit at its
+default value. However, this option is available for cases when a
+request requires more than 10 million rows and pagination using
+``@page_index`` and ``@page_size=9999999`` is not possible.
+
 Adding a Web Content Security Policy (CSP)
 ==========================================
 
diff --git a/doc/customizing.txt b/doc/customizing.txt
@@ -2176,6 +2176,8 @@ string in the nosy reaction function.
 Changing How the Core Code Works
 --------------------------------
 
+.. index:: single: interfaces.py; cache-control headers
+
 Changing Cache-Control Headers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -2202,6 +2204,7 @@ path in the tracker's `config.ini`. In the example above:
 
 Note that a file name match overrides the mime type settings.
 
+.. index:: single: interfaces.py; password complexity checking
 
 Implement Password Complexity Checking
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2241,6 +2244,8 @@ it replaces the setPassword method in the Password class. The new
 version validates that the password is sufficiently complex. Then it
 passes off the setting of password to the original method.
 
+.. index:: single: interfaces.py; interpreting time interval values
+
 Enhance Time Intervals Forms
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/doc/reference.txt b/doc/reference.txt
@@ -1556,6 +1556,8 @@ flexible extension point mechanism.
 .. _interfaces.py:
 .. _modifying the core of Roundup:
 
+.. index:: single: interfaces.py; hooking into the roundup core
+
 interfaces.py - hooking into the core of Roundup
 ================================================
 
diff --git a/doc/rest.txt b/doc/rest.txt
@@ -491,11 +491,18 @@ Details are in the sections below.
 /data/\ *class* Collection
 --------------------------
 
-When performing the ``GET`` method on a class (e.g. ``/data/issue``),
-the ``data`` object includes the number of items available in
-``@total_size``. A a ``collection`` list follows which contains the id
-and link to the respective item.  For example a get on
-https://.../rest/data/issue returns::
+When you use the ``GET`` method on a class (like ``/data/issue``), the
+``data`` will include the number of available items in
+``@total_size``. If the size exceeds the administrative limit (which
+is 10 million by default), ``@total_size`` will be set to ``-1``. To
+navigate to the last page of results, you can use the ``next`` links
+or increment ``@page_index`` until the result does not include a
+``next`` ``@link`` or ``@total_size`` is not ``-1``. The value of the
+HTTP header ``X-Count-Total`` is the same as ``@total_size``.
+
+A ``collection`` list contains the id and link to the
+respective item.  For example a get on https://.../rest/data/issue
+returns::
 
     {
         "data": {
@@ -517,11 +524,17 @@ https://.../rest/data/issue returns::
 Collection endpoints support a number of features as seen in the next
 sections.
 
-A server may implement a default maximum number of items in the
-collection.  This can be used to prevent denial of service (DOS).  As
-a result all clients must be programmed to expect pagination
-decorations in the response. See the section on pagination below for
-details.
+Having an empty ``collection`` does not mean next next link will not
+return more data. The row limit is applied when the query is made to
+the database. The result set is then filtered, removing rows that the
+user does not have permission to access. So it is possible to have no
+data items on a page because the user does not have access to them. If
+you use ``@page_size`` near the administrative limit, you may receive
+fewer rows than requested. However, this does not mean you are out of
+data.
+
+All clients must be programmed to expect pagination decorations in the
+response. See the section on pagination below for details.
 
 Searching
 ~~~~~~~~~
@@ -591,7 +604,9 @@ searching keyword class not issue class) will return matches for
 ``Foo``, ``foobar``, ``foo taz`` etc.
 
 In all cases the field ``@total_size`` is reported which is the total
-number of items available if you were to retrieve all of them.
+number of items available if you were to retrieve all of them. See
+more details in the parent section about ``@total_size`` and when it
+can return ``-1``.
 
 Other data types: Date, Interval, Integer, Number need examples and may
 need work to allow range searches. Full text search (e.g. over the
@@ -1055,7 +1070,7 @@ Does not support PUT, DELETE or PATCH.
 /data/\ *class*/\ *id* item
 ---------------------------
 
-When performing the ``GET`` method on an item
+When you use the ``GET`` method on an item
 (e.g. ``/data/issue/42``), a ``link`` attribute contains the link to
 the item, ``id`` contains the id, ``type`` contains the class name
 (e.g. ``issue`` in the example) and an ``etag`` property can be used
diff --git a/roundup/rest.py b/roundup/rest.py
@@ -440,6 +440,10 @@ class RestfulInstance(object):
 
     api_version = None
 
+    # allow 10M row response - can change using interfaces.py
+    # limit is 1 less than this size.
+    max_response_row_size = 10000001
+
     def __init__(self, client, db):
         self.client = client
         self.db = db
@@ -795,7 +799,7 @@ def get_collection(self, class_name, input):
         exact_props = {}
         page = {
             'size': None,
-            'index': 1   # setting just size starts at page 1
+            'index': 1,   # setting just size starts at page 1
         }
         verbose = 1
         display_props = set()
@@ -910,12 +914,26 @@ def get_collection(self, class_name, input):
             l.append(sort)
         if exact_props:
             kw['exact_match_spec'] = exact_props
-        if page['size'] is not None and page['size'] > 0:
-            kw['limit'] = page['size']
+        if page['size'] is None:
+            kw['limit'] = self.max_response_row_size
+        elif page['size'] > 0:
+            if page['size'] >= self.max_response_row_size:
+                raise UsageError(_(
+                    "Page size %(page_size)s must be less than admin "
+                    "limit on query result size: %(max_size)s.") % {
+                        "page_size": page['size'],
+                        "max_size": self.max_response_row_size,
+                    })
+            kw['limit'] = self.max_response_row_size
             if page['index'] is not None and page['index'] > 1:
                 kw['offset'] = (page['index'] - 1) * page['size']
         obj_list = class_obj.filter(None, *l, **kw)
 
+        # Have we hit the max number of returned rows?
+        # If so there may be more data that the client
+        # has to explicitly page through using offset/@page_index.
+        overflow = len(obj_list) == self.max_response_row_size
+
         # Note: We don't sort explicitly in python. The filter implementation
         # of the DB already sorts by ID if no sort option was given.
 
@@ -930,7 +948,7 @@ def get_collection(self, class_name, input):
         for item_id in obj_list:
             r = {}
             if self.db.security.hasPermission(
-                'View', uid, class_name, itemid=item_id, property='id'
+                'View', uid, class_name, itemid=item_id, property='id',
             ):
                 r = {'id': item_id, 'link': class_path + item_id}
             if display_props:
@@ -942,14 +960,30 @@ def get_collection(self, class_name, input):
 
         result_len = len(result['collection'])
 
+        if not overflow:  # noqa: SIM108  - no nested ternary
+            # add back the number of items in the offset.
+            total_len = kw['offset'] + result_len if 'offset' in kw \
+                else result_len
+        else:
+            # we have hit the max number of rows configured to be
+            # returned.  We hae no idea how many rows can match. We
+            # could use 0 as the sentinel, but a filter could match 0
+            # rows.  So return -1 indicating we exceeded the result
+            # max size on this query.
+            total_len = -1
+
+        # truncate result['collection'] to page size
+        if page['size'] is not None and page['size'] > 0:
+            result['collection'] = result['collection'][:page['size']]
+
         # pagination - page_index from 1...N
         if page['size'] is not None and page['size'] > 0:
             result['@links'] = {}
             for rel in ('next', 'prev', 'self'):
                 if rel == 'next':
                     # if current index includes all data, continue
-                    if page['size'] > result_len: continue  # noqa: E701
-                    index = page['index']+1
+                    if page['size'] >= result_len: continue  # noqa: E701
+                    index = page['index'] + 1
                 if rel == 'prev':
                     if page['index'] <= 1: continue  # noqa: E701
                     index = page['index'] - 1
@@ -964,8 +998,8 @@ def get_collection(self, class_name, input):
                                      for field in input.value
                                      if field.name != "@page_index"])})
 
-        result['@total_size'] = result_len
-        self.client.setHeader("X-Count-Total", str(result_len))
+        result['@total_size'] = total_len
+        self.client.setHeader("X-Count-Total", str(total_len))
         self.client.setHeader("Allow", "OPTIONS, GET, POST")
         return 200, result
 
diff --git a/test/rest_common.py b/test/rest_common.py