postgresql native-fts; more indexer tests

rouilj · rouilj · commit e00d0dde7e14 · 2022-09-05T16:25:20.000-04:00
1) Make postgresql native-fts actually work.
2) Add simple stopword filtering to sqlite native-fts indexer.
3) Add more tests for indexer_common get_indexer


Details:

1) roundup/backends/indexer_postgresql_fts.py:

  ignore ValueError raised if we try to index a string with a null
  character in it. This could happen due to an incorrect text/ mime
  type on a file that has nulls in it.

  Replace ValueError raised by postgresql with customized
  IndexerQueryError if a search string has a null in it.

roundup/backends/rdbms_common.py:

  Make postgresql native-fts work. When specified it was using using
  whatever was returned from get_indexer(). However loading the
  native-fts indexer backend failed because there was no connection to
  the postgresql database when this call was made.

  Simple solution, move the call after the open_connection call in
  Database::__init__().

  However the open_connection call creates the schema for the
  database if it is not there. The schema builds tables for
  indexer=native type indexing. As part of the build it looks at the
  indexer to see the min/max size of the indexed tokens. No indexer
  define, we get a crash.

  So it's a a chicken/egg issue. I solved it by setting the indexer
  to the Indexer from indexer_common which has the min/max token size
  info. I also added a no-op save_indexer to this Indexer class. I
  claim save_indexer() isn't needed as a commit() on the db does all
  the saving required. Then after open_connection is called, I call
  get_indexer to retrieve the correct indexer and
  indexer_postgresql_fts woks since the conn connection property is
  defined.

roundup/backends/indexer_common.py:

  add save_index() method for indexer. It does nothing but is needed
  in rdbms backends during schema initialization.

2) roundup/backends/indexer_sqlite_fts.py:

  when this indexer is used, the indexer test in DBTest on the word
  "the" fail. This is due to missing stopword filtering. Implement
  basic stopword filtering for bare stopwords (like 'the') to make the
  test pass. Note: this indexer is not currently automatically run by
  the CI suite, it was found during manual testing. However there is a
  FIXME to extract the indexer tests from DBTest and run it using this
  backend.

roundup/configuration.py, roundup/doc/admin_guide.txt:

  update doc on stopword use for sqlite native-fts.

test/db_test_base.py:

  DBTest::testStringBinary creates a file with nulls in it. It was
  breaking postgresql with native-fts indexer. Changed test to assign
  mime type application/octet-stream that prevents it from being
  processed by any text search indexer.

  add test to exclude indexer searching in specific props. This code
  path was untested before.


test/test_indexer.py:

  add test to call find with no words. Untested code path.

  add test to index and find a string with a null \x00 byte.  it was
  tested inadvertently by testStringBinary but this makes it explicit
  and moves it to indexer testing. (one version each for: generic,
  postgresql and mysql)

  Renamed Get_IndexerAutoSelectTest to Get_IndexerTest and renamed
  autoselect tests to include autoselect. Added tests for an invalid
  indexer and using native-fts with anydbm (unsupported combo) to make
  sure the code does something useful if the validation in
  configuration.py is broken.

test/test_liveserver.py:

  add test to load an issue

  add test using text search (fts) to find the issue

  add tests to find issue using postgresql native-fts

test/test_postgresql.py, test/test_sqlite.py:

  added explanation on how to setup integration test using native-fts.

  added code to clean up test environment if native-fts test is run.
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -40,6 +40,11 @@ Fixed:
   application/javascript. (John Rouillard)
 - Enable postgres-fts: fix indexer-common::get_indexer so it returns a
   postgresql-fts Test code paths in get_indexer. (John Rouillard)
+- Fix Postgres native-fts, implement a two phase initialization of the
+  indexer. The native-fts one gets assigned after the database
+  connection is open. (John Rouillard)
+- fix crash if postgresql native-fts backend is asked to index content
+  with null bytes. (John Rouillard)
 
 Features:
 
@@ -50,7 +55,10 @@ Features:
 - issue2550559 - Pretty printing / formatting for Number types.
   Added pretty(format='%0.3f') method to NumberHTMLProperty to
   print numeric values. If value is None, return empty string
-  otherwise str() of value.
+  otherwise str() of value. (John Rouillard)
+- sqlite native-fts backend now uses the stopwords list in config.ini
+  to filter words from queries. (Stopwords are still indexed so that
+  phrase/proximity searches still work.) (John Rouillard)
 
 2022-07-13 2.2.0
 
diff --git a/doc/admin_guide.txt b/doc/admin_guide.txt
@@ -326,10 +326,14 @@ All of the data that is indexed is in a single column, so when column
 specifiers are used they usually result in an error which is detected
 and an enhanced error message is produced.
 
-Unlike the native, xapian and whoosh indexers, there are no stopwords,
-and there is no limit to the length of terms that are indexed. Keeping
-these would break proximity and phrase searching. This may be helpful
-or problematic for your particular tracker.
+Unlike the native, xapian and whoosh indexers there is no
+limit to the length of terms that are indexed. Also
+stopwords are indexed but ignored when searching if they are
+the only word in the search. So a search for "the" will
+return no results but "the book" will return
+results. Pre-filtering the stopwords when indexing would
+break proximity and phrase searching. This may be helpful or
+problematic for your particular tracker.
 
 To support the most languages available, the unicode61 tokenizer is
 used without porter stemming. Using the ``indexer_language`` setting
diff --git a/roundup/backends/indexer_common.py b/roundup/backends/indexer_common.py
@@ -35,6 +35,9 @@ def is_stopword(self, word):
     def getHits(self, search_terms, klass):
         return self.find(search_terms)
 
+    def save_index(self):
+        pass
+
     def search(self, search_terms, klass, ignore=None):
         """Display search results looking for [search, terms] associated
         with the hyperdb Class "klass". Ignore hits on {class: property}.
diff --git a/roundup/backends/indexer_postgresql_fts.py b/roundup/backends/indexer_postgresql_fts.py
@@ -71,8 +71,13 @@ def add_text(self, identifier, text, mime_type='text/plain'):
             # not previously indexed
             sql = 'insert into __fts (_class, _itemid, _prop, _tsv)'\
                 ' values (%s, %s, %s, to_tsvector(%s, %s))' % (a, a, a, a, a)
-            self.db.cursor.execute(sql, identifier +
-                                   (self.db.config['INDEXER_LANGUAGE'], text))
+            try:
+                self.db.cursor.execute(sql, identifier +
+                                    (self.db.config['INDEXER_LANGUAGE'], text))
+            except ValueError:
+                # if text is binary or otherwise un-indexable,
+                # we get a ValueError. For right now ignore it.
+                pass
         else:
             id = r[0]
             sql = 'update __fts set _tsv=to_tsvector(%s, %s) where ctid=%s' % \
@@ -122,6 +127,11 @@ def find(self, wordlist):
             self.db.rollback()
 
             raise IndexerQueryError(e.args[0])
+        except ValueError as e:
+            # raised when search string has a null bytes in it or
+            # is otherwise unsuitable.
+            raise IndexerQueryError(
+                "Invalid search string, do you have a null in there? " + e.args[0])
         except InFailedSqlTransaction:
             # reset the cursor as it's invalid currently
             self.db.rollback()
diff --git a/roundup/backends/indexer_sqlite_fts.py b/roundup/backends/indexer_sqlite_fts.py
@@ -97,6 +97,22 @@ def find(self, wordlist):
 
            https://www.sqlite.org/fts5.html#full_text_query_syntax
         """
+
+        # Filter out stopwords. Other searches tokenize the user query
+        # into an list of simple word tokens. For fTS, query
+        # tokenization doesn't occur.
+
+        # A user's FTS query is a wordlist with one element.  The
+        # element is a string to parse and will probably not match a
+        # stop word.
+        #
+        # However the generic indexer search tests pass in a list of
+        # word tokens. We filter the word tokens so it behaves like
+        # other backends.  This means that a search for a simple word
+        # like 'the' (without quotes) will return no hits, as the test
+        # expects.
+        wordlist = [w for w in wordlist if not self.is_stopword(w.upper())]
+
         if not wordlist:
             return []
 
diff --git a/roundup/backends/rdbms_common.py b/roundup/backends/rdbms_common.py
@@ -63,6 +63,7 @@
 # support
 from roundup.backends.blobfiles import FileStorage
 from roundup.backends.indexer_common import get_indexer
+from roundup.backends.indexer_common import Indexer as CommonIndexer
 from roundup.backends.sessions_rdbms import Sessions, OneTimeKeys
 from roundup.date import Range
 
@@ -174,7 +175,20 @@ def __init__(self, config, journaltag=None):
         self.config, self.journaltag = config, journaltag
         self.dir = config.DATABASE
         self.classes = {}
-        self.indexer = get_indexer(config, self)
+        # Assign the indexer base class here.  During schema
+        # generation in open_connection, the min/max size for FTS
+        # tokens is used when creating the database tables for
+        # indexer=native full text search. These tables are always
+        # created as part of the schema so that the admin can choose
+        # indexer=native at some later date and "things will just
+        # work" (TM).
+        #
+        # We would like to use get_indexer() to return the real
+        # indexer class. However indexer=native-fts for postgres
+        # requires a database connection (conn) to be defined when
+        # calling get_indexer.  The call to open_connection creates the
+        # conn but also creates the schema if it is missing.
+        self.indexer = CommonIndexer(self)
         self.security = security.Security(self)
 
         # additional transaction support for external files and the like
@@ -201,6 +215,10 @@ def __init__(self, config, journaltag=None):
         # open a connection to the database, creating the "conn" attribute
         self.open_connection()
 
+        # If indexer is native-fts, conn to db must be available.
+        # so we set the real self.indexer value here, after db is open.
+        self.indexer = get_indexer(config, self)
+
     def clearCache(self):
         self.cache = {}
         self.cache_lru = []
diff --git a/roundup/configuration.py b/roundup/configuration.py
@@ -1057,7 +1057,8 @@ def str2value(self, value):
             "Additional stop-words for the full-text indexer specific to\n"
             "your tracker. See the indexer source for the default list of\n"
             "stop-words (eg. A,AND,ARE,AS,AT,BE,BUT,BY, ...). This is\n"
-            "not used by the native-fts indexer."),
+            "not used by the postgres native-fts indexer. But is used to\n"
+            "filter search terms with the sqlite native-fts indexer."),
         (OctalNumberOption, "umask", "0o002",
             "Defines the file creation mode mask."),
         (IntegerNumberGeqZeroOption, 'csv_field_size', '131072',
diff --git a/test/db_test_base.py b/test/db_test_base.py
@@ -415,7 +415,10 @@ def testStringBinary(self):
         bstr = b'\x00\xF0\x34\x33' # random binary data
 
         # test set & retrieve (this time for file contents)
-        nid = self.db.file.create(content=bstr)
+        # Since it has null in it, set it to a binary mime type
+        # so indexer's don't try to index it.
+        nid = self.db.file.create(content=bstr,
+                                  type="application/octet-stream")
         print(nid)
         print(repr(self.db.file.get(nid, 'content')))
         print(repr(self.db.file.get(nid, 'binary_content')))
@@ -1523,6 +1526,32 @@ def testIndexerSearching(self):
         # unindexed stopword
         self.assertEqual(self.db.indexer.search(['the'], self.db.issue), {})
 
+    def testIndexerSearchingIgnoreProps(self):
+        f1 = self.db.file.create(content='hello', type="text/plain")
+        # content='world' has the wrong content-type and won't be indexed
+        f2 = self.db.file.create(content='world', type="text/frozz",
+            comment='blah blah')
+        i1 = self.db.issue.create(files=[f1, f2], title="flebble plop")
+        i2 = self.db.issue.create(title="flebble the frooz")
+        self.db.commit()
+
+        # filter out hits that are in the titpe prop of issues
+        self.assertEqual(self.db.indexer.search(['frooz'], self.db.issue,
+                                        ignore={('issue', 'title'): True}),
+                         {})
+
+        # filter out hits in the title prop of content. Note the returned
+        # match is in a file not an issue, so ignore has no effect.
+        # also there is no content property for issue.
+        self.assertEqual(self.db.indexer.search(['hello'], self.db.issue,
+                                        ignore={('issue', 'content'): True}),
+            {f1: {'files': ['1']}})
+
+        # filter out file content property hit leaving no results
+        self.assertEqual(self.db.indexer.search(['hello'], self.db.issue,
+                                        ignore={('file', 'content'): True}),
+            {})
+
     def testIndexerSearchingLink(self):
         m1 = self.db.msg.create(content="one two")
         i1 = self.db.issue.create(messages=[m1])
diff --git a/test/test_indexer.py b/test/test_indexer.py
@@ -97,6 +97,7 @@ def test_basics(self):
                                                     ('test', '2', 'foo')])
         self.assertSeqEqual(self.dex.find(['blah']), [('test', '2', 'foo')])
         self.assertSeqEqual(self.dex.find(['blah', 'hello']), [])
+        self.assertSeqEqual(self.dex.find([]), [])
 
     def test_change(self):
         self.dex.add_text(('test', '1', 'foo'), 'a the hello world')
@@ -207,6 +208,14 @@ def test_unicode(self):
                     [('test', '1', 'a'), ('test', '2', 'a')])
         self.assertSeqEqual(self.dex.find([u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439']),
                             [('test', '2', 'a')])
+
+    def testNullChar(self):
+       """Test with null char in string. Postgres FTS will not index
+          it will just ignore string for now.
+       """
+       string="\x00\x01fred\x255"
+       self.dex.add_text(('test', '1', 'a'), string)
+       self.assertSeqEqual(self.dex.find(string), [])
         
     def tearDown(self):
         shutil.rmtree('test-index')
@@ -247,7 +256,7 @@ def setUp(self):
     def tearDown(self):
         IndexerTest.tearDown(self)
 
-class Get_IndexerAutoSelectTest(anydbmOpener, unittest.TestCase):
+class Get_IndexerTest(anydbmOpener, unittest.TestCase):
     
     def setUp(self):
         # remove previous test, ignore errors
@@ -265,26 +274,67 @@ def tearDown(self):
             shutil.rmtree(config.DATABASE)
 
     @skip_xapian
-    def test_xapian_select(self):
+    def test_xapian_autoselect(self):
         indexer = get_indexer(self.db.config, self.db)
         self.assertIn('roundup.backends.indexer_xapian.Indexer', str(indexer))
 
     @skip_whoosh
-    def test_whoosh_select(self):
+    def test_whoosh_autoselect(self):
         import mock, sys
         with mock.patch.dict('sys.modules',
                              {'roundup.backends.indexer_xapian': None}):
             indexer = get_indexer(self.db.config, self.db)
         self.assertIn('roundup.backends.indexer_whoosh.Indexer', str(indexer))
 
-    def test_native_select(self):
+    def test_native_autoselect(self):
         import mock, sys
         with mock.patch.dict('sys.modules',
                              {'roundup.backends.indexer_xapian': None,
                               'roundup.backends.indexer_whoosh': None}):
             indexer = get_indexer(self.db.config, self.db)
         self.assertIn('roundup.backends.indexer_dbm.Indexer', str(indexer))
 
+    def test_invalid_indexer(self):
+        """There is code at the end of indexer_common::get_indexer() to
+           raise an AssertionError if the indexer name is invalid.
+           This should never be triggered. If it is, it means that
+           the code in configure.py that validates indexer names
+           allows a name through that get_indexer can't handle.
+
+           Simulate that failure and make sure that the
+           AssertionError is raised.
+
+        """
+
+        with self.assertRaises(ValueError) as cm:
+            self.db.config['INDEXER'] = 'no_such_indexer'
+
+        # mangle things so we can test AssertionError at end
+        # get_indexer()
+        from roundup.configuration import IndexerOption
+        IndexerOption.allowed.append("unrecognized_indexer")
+        self.db.config['INDEXER'] = "unrecognized_indexer"
+
+        with self.assertRaises(AssertionError) as cm:
+            indexer = get_indexer(self.db.config, self.db)
+
+        # unmangle state
+        IndexerOption.allowed.pop()
+        self.assertNotIn("unrecognized_indexer", IndexerOption.allowed)
+        self.db.config['INDEXER'] = ""
+
+    def test_unsupported_by_db(self):
+        """This requires that the db associated with the test
+           is not sqlite or postgres. anydbm works fine to trigger
+           the error.
+        """
+        self.db.config['INDEXER'] = 'native-fts'
+        with self.assertRaises(AssertionError) as cm:
+            get_indexer(self.db.config, self.db)
+
+        self.assertIn("native-fts", cm.exception.args[0])
+        self.db.config['INDEXER'] = ''
+
 class RDBMSIndexerTest(object):
     def setUp(self):
         # remove previous test, ignore errors
@@ -520,6 +570,26 @@ def test_invalid_language(self):
 
         self.db.config["INDEXER_LANGUAGE"] = "english"
 
+    def testNullChar(self):
+       """Test with null char in string. Postgres FTS throws a ValueError
+          on indexing which we ignore. This could happen when
+          indexing a binary file with a bad mime type. On find, it
+          throws a ProgrammingError that we remap to
+          IndexerQueryError and pass up. If a null gets to that
+          level on search somebody entered it (not sure how you
+          could actually do that) but we want a crash in that case
+          as the person is probably up to "no good" (R) (TM).
+
+       """
+       import psycopg2
+
+       string="\x00\x01fred\x255"
+       self.dex.add_text(('test', '1', 'a'), string)
+       with self.assertRaises(IndexerQueryError) as ctx:
+           self.assertSeqEqual(self.dex.find(string), [])
+
+       self.assertIn("null", ctx.exception.args[0])
+
 @skip_mysql
 class mysqlIndexerTest(mysqlOpener, RDBMSIndexerTest, IndexerTest):
     def setUp(self):
@@ -661,4 +731,15 @@ def test_query_errors(self):
         error = 'Query error: syntax error near "^"'
         self.assertEqual(str(ctx.exception), error)
 
+    def testNullChar(self):
+       """Test with null char in string. FTS will throw
+          an error on null.
+       """
+       import psycopg2
+
+       string="\x00\x01fred\x255"
+       self.dex.add_text(('test', '1', 'a'), string)
+       with self.assertRaises(IndexerQueryError) as cm:
+           self.assertSeqEqual(self.dex.find(string), [])
+
 # vim: set filetype=python ts=4 sw=4 et si
diff --git a/test/test_liveserver.py b/test/test_liveserver.py
diff --git a/test/test_postgresql.py b/test/test_postgresql.py
diff --git a/test/test_sqlite.py b/test/test_sqlite.py