Skip to content

Commit e00d0dd

Browse files
committed
postgresql native-fts; more indexer tests
1) Make postgresql native-fts actually work. 2) Add simple stopword filtering to sqlite native-fts indexer. 3) Add more tests for indexer_common get_indexer Details: 1) roundup/backends/indexer_postgresql_fts.py: ignore ValueError raised if we try to index a string with a null character in it. This could happen due to an incorrect text/ mime type on a file that has nulls in it. Replace ValueError raised by postgresql with customized IndexerQueryError if a search string has a null in it. roundup/backends/rdbms_common.py: Make postgresql native-fts work. When specified it was using using whatever was returned from get_indexer(). However loading the native-fts indexer backend failed because there was no connection to the postgresql database when this call was made. Simple solution, move the call after the open_connection call in Database::__init__(). However the open_connection call creates the schema for the database if it is not there. The schema builds tables for indexer=native type indexing. As part of the build it looks at the indexer to see the min/max size of the indexed tokens. No indexer define, we get a crash. So it's a a chicken/egg issue. I solved it by setting the indexer to the Indexer from indexer_common which has the min/max token size info. I also added a no-op save_indexer to this Indexer class. I claim save_indexer() isn't needed as a commit() on the db does all the saving required. Then after open_connection is called, I call get_indexer to retrieve the correct indexer and indexer_postgresql_fts woks since the conn connection property is defined. roundup/backends/indexer_common.py: add save_index() method for indexer. It does nothing but is needed in rdbms backends during schema initialization. 2) roundup/backends/indexer_sqlite_fts.py: when this indexer is used, the indexer test in DBTest on the word "the" fail. This is due to missing stopword filtering. Implement basic stopword filtering for bare stopwords (like 'the') to make the test pass. Note: this indexer is not currently automatically run by the CI suite, it was found during manual testing. However there is a FIXME to extract the indexer tests from DBTest and run it using this backend. roundup/configuration.py, roundup/doc/admin_guide.txt: update doc on stopword use for sqlite native-fts. test/db_test_base.py: DBTest::testStringBinary creates a file with nulls in it. It was breaking postgresql with native-fts indexer. Changed test to assign mime type application/octet-stream that prevents it from being processed by any text search indexer. add test to exclude indexer searching in specific props. This code path was untested before. test/test_indexer.py: add test to call find with no words. Untested code path. add test to index and find a string with a null \x00 byte. it was tested inadvertently by testStringBinary but this makes it explicit and moves it to indexer testing. (one version each for: generic, postgresql and mysql) Renamed Get_IndexerAutoSelectTest to Get_IndexerTest and renamed autoselect tests to include autoselect. Added tests for an invalid indexer and using native-fts with anydbm (unsupported combo) to make sure the code does something useful if the validation in configuration.py is broken. test/test_liveserver.py: add test to load an issue add test using text search (fts) to find the issue add tests to find issue using postgresql native-fts test/test_postgresql.py, test/test_sqlite.py: added explanation on how to setup integration test using native-fts. added code to clean up test environment if native-fts test is run.
1 parent 53490a2 commit e00d0dd

File tree

12 files changed

+304
-14
lines changed

12 files changed

+304
-14
lines changed

CHANGES.txt

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,11 @@ Fixed:
4040
application/javascript. (John Rouillard)
4141
- Enable postgres-fts: fix indexer-common::get_indexer so it returns a
4242
postgresql-fts Test code paths in get_indexer. (John Rouillard)
43+
- Fix Postgres native-fts, implement a two phase initialization of the
44+
indexer. The native-fts one gets assigned after the database
45+
connection is open. (John Rouillard)
46+
- fix crash if postgresql native-fts backend is asked to index content
47+
with null bytes. (John Rouillard)
4348

4449
Features:
4550

@@ -50,7 +55,10 @@ Features:
5055
- issue2550559 - Pretty printing / formatting for Number types.
5156
Added pretty(format='%0.3f') method to NumberHTMLProperty to
5257
print numeric values. If value is None, return empty string
53-
otherwise str() of value.
58+
otherwise str() of value. (John Rouillard)
59+
- sqlite native-fts backend now uses the stopwords list in config.ini
60+
to filter words from queries. (Stopwords are still indexed so that
61+
phrase/proximity searches still work.) (John Rouillard)
5462

5563
2022-07-13 2.2.0
5664

doc/admin_guide.txt

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -326,10 +326,14 @@ All of the data that is indexed is in a single column, so when column
326326
specifiers are used they usually result in an error which is detected
327327
and an enhanced error message is produced.
328328

329-
Unlike the native, xapian and whoosh indexers, there are no stopwords,
330-
and there is no limit to the length of terms that are indexed. Keeping
331-
these would break proximity and phrase searching. This may be helpful
332-
or problematic for your particular tracker.
329+
Unlike the native, xapian and whoosh indexers there is no
330+
limit to the length of terms that are indexed. Also
331+
stopwords are indexed but ignored when searching if they are
332+
the only word in the search. So a search for "the" will
333+
return no results but "the book" will return
334+
results. Pre-filtering the stopwords when indexing would
335+
break proximity and phrase searching. This may be helpful or
336+
problematic for your particular tracker.
333337

334338
To support the most languages available, the unicode61 tokenizer is
335339
used without porter stemming. Using the ``indexer_language`` setting

roundup/backends/indexer_common.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ def is_stopword(self, word):
3535
def getHits(self, search_terms, klass):
3636
return self.find(search_terms)
3737

38+
def save_index(self):
39+
pass
40+
3841
def search(self, search_terms, klass, ignore=None):
3942
"""Display search results looking for [search, terms] associated
4043
with the hyperdb Class "klass". Ignore hits on {class: property}.

roundup/backends/indexer_postgresql_fts.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,8 +71,13 @@ def add_text(self, identifier, text, mime_type='text/plain'):
7171
# not previously indexed
7272
sql = 'insert into __fts (_class, _itemid, _prop, _tsv)'\
7373
' values (%s, %s, %s, to_tsvector(%s, %s))' % (a, a, a, a, a)
74-
self.db.cursor.execute(sql, identifier +
75-
(self.db.config['INDEXER_LANGUAGE'], text))
74+
try:
75+
self.db.cursor.execute(sql, identifier +
76+
(self.db.config['INDEXER_LANGUAGE'], text))
77+
except ValueError:
78+
# if text is binary or otherwise un-indexable,
79+
# we get a ValueError. For right now ignore it.
80+
pass
7681
else:
7782
id = r[0]
7883
sql = 'update __fts set _tsv=to_tsvector(%s, %s) where ctid=%s' % \
@@ -122,6 +127,11 @@ def find(self, wordlist):
122127
self.db.rollback()
123128

124129
raise IndexerQueryError(e.args[0])
130+
except ValueError as e:
131+
# raised when search string has a null bytes in it or
132+
# is otherwise unsuitable.
133+
raise IndexerQueryError(
134+
"Invalid search string, do you have a null in there? " + e.args[0])
125135
except InFailedSqlTransaction:
126136
# reset the cursor as it's invalid currently
127137
self.db.rollback()

roundup/backends/indexer_sqlite_fts.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,22 @@ def find(self, wordlist):
9797
9898
https://www.sqlite.org/fts5.html#full_text_query_syntax
9999
"""
100+
101+
# Filter out stopwords. Other searches tokenize the user query
102+
# into an list of simple word tokens. For fTS, query
103+
# tokenization doesn't occur.
104+
105+
# A user's FTS query is a wordlist with one element. The
106+
# element is a string to parse and will probably not match a
107+
# stop word.
108+
#
109+
# However the generic indexer search tests pass in a list of
110+
# word tokens. We filter the word tokens so it behaves like
111+
# other backends. This means that a search for a simple word
112+
# like 'the' (without quotes) will return no hits, as the test
113+
# expects.
114+
wordlist = [w for w in wordlist if not self.is_stopword(w.upper())]
115+
100116
if not wordlist:
101117
return []
102118

roundup/backends/rdbms_common.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
# support
6464
from roundup.backends.blobfiles import FileStorage
6565
from roundup.backends.indexer_common import get_indexer
66+
from roundup.backends.indexer_common import Indexer as CommonIndexer
6667
from roundup.backends.sessions_rdbms import Sessions, OneTimeKeys
6768
from roundup.date import Range
6869

@@ -174,7 +175,20 @@ def __init__(self, config, journaltag=None):
174175
self.config, self.journaltag = config, journaltag
175176
self.dir = config.DATABASE
176177
self.classes = {}
177-
self.indexer = get_indexer(config, self)
178+
# Assign the indexer base class here. During schema
179+
# generation in open_connection, the min/max size for FTS
180+
# tokens is used when creating the database tables for
181+
# indexer=native full text search. These tables are always
182+
# created as part of the schema so that the admin can choose
183+
# indexer=native at some later date and "things will just
184+
# work" (TM).
185+
#
186+
# We would like to use get_indexer() to return the real
187+
# indexer class. However indexer=native-fts for postgres
188+
# requires a database connection (conn) to be defined when
189+
# calling get_indexer. The call to open_connection creates the
190+
# conn but also creates the schema if it is missing.
191+
self.indexer = CommonIndexer(self)
178192
self.security = security.Security(self)
179193

180194
# additional transaction support for external files and the like
@@ -201,6 +215,10 @@ def __init__(self, config, journaltag=None):
201215
# open a connection to the database, creating the "conn" attribute
202216
self.open_connection()
203217

218+
# If indexer is native-fts, conn to db must be available.
219+
# so we set the real self.indexer value here, after db is open.
220+
self.indexer = get_indexer(config, self)
221+
204222
def clearCache(self):
205223
self.cache = {}
206224
self.cache_lru = []

roundup/configuration.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1057,7 +1057,8 @@ def str2value(self, value):
10571057
"Additional stop-words for the full-text indexer specific to\n"
10581058
"your tracker. See the indexer source for the default list of\n"
10591059
"stop-words (eg. A,AND,ARE,AS,AT,BE,BUT,BY, ...). This is\n"
1060-
"not used by the native-fts indexer."),
1060+
"not used by the postgres native-fts indexer. But is used to\n"
1061+
"filter search terms with the sqlite native-fts indexer."),
10611062
(OctalNumberOption, "umask", "0o002",
10621063
"Defines the file creation mode mask."),
10631064
(IntegerNumberGeqZeroOption, 'csv_field_size', '131072',

test/db_test_base.py

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -415,7 +415,10 @@ def testStringBinary(self):
415415
bstr = b'\x00\xF0\x34\x33' # random binary data
416416

417417
# test set & retrieve (this time for file contents)
418-
nid = self.db.file.create(content=bstr)
418+
# Since it has null in it, set it to a binary mime type
419+
# so indexer's don't try to index it.
420+
nid = self.db.file.create(content=bstr,
421+
type="application/octet-stream")
419422
print(nid)
420423
print(repr(self.db.file.get(nid, 'content')))
421424
print(repr(self.db.file.get(nid, 'binary_content')))
@@ -1523,6 +1526,32 @@ def testIndexerSearching(self):
15231526
# unindexed stopword
15241527
self.assertEqual(self.db.indexer.search(['the'], self.db.issue), {})
15251528

1529+
def testIndexerSearchingIgnoreProps(self):
1530+
f1 = self.db.file.create(content='hello', type="text/plain")
1531+
# content='world' has the wrong content-type and won't be indexed
1532+
f2 = self.db.file.create(content='world', type="text/frozz",
1533+
comment='blah blah')
1534+
i1 = self.db.issue.create(files=[f1, f2], title="flebble plop")
1535+
i2 = self.db.issue.create(title="flebble the frooz")
1536+
self.db.commit()
1537+
1538+
# filter out hits that are in the titpe prop of issues
1539+
self.assertEqual(self.db.indexer.search(['frooz'], self.db.issue,
1540+
ignore={('issue', 'title'): True}),
1541+
{})
1542+
1543+
# filter out hits in the title prop of content. Note the returned
1544+
# match is in a file not an issue, so ignore has no effect.
1545+
# also there is no content property for issue.
1546+
self.assertEqual(self.db.indexer.search(['hello'], self.db.issue,
1547+
ignore={('issue', 'content'): True}),
1548+
{f1: {'files': ['1']}})
1549+
1550+
# filter out file content property hit leaving no results
1551+
self.assertEqual(self.db.indexer.search(['hello'], self.db.issue,
1552+
ignore={('file', 'content'): True}),
1553+
{})
1554+
15261555
def testIndexerSearchingLink(self):
15271556
m1 = self.db.msg.create(content="one two")
15281557
i1 = self.db.issue.create(messages=[m1])

test/test_indexer.py

Lines changed: 85 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ def test_basics(self):
9797
('test', '2', 'foo')])
9898
self.assertSeqEqual(self.dex.find(['blah']), [('test', '2', 'foo')])
9999
self.assertSeqEqual(self.dex.find(['blah', 'hello']), [])
100+
self.assertSeqEqual(self.dex.find([]), [])
100101

101102
def test_change(self):
102103
self.dex.add_text(('test', '1', 'foo'), 'a the hello world')
@@ -207,6 +208,14 @@ def test_unicode(self):
207208
[('test', '1', 'a'), ('test', '2', 'a')])
208209
self.assertSeqEqual(self.dex.find([u'\u0440\u0443\u0441\u0441\u043a\u0438\u0439']),
209210
[('test', '2', 'a')])
211+
212+
def testNullChar(self):
213+
"""Test with null char in string. Postgres FTS will not index
214+
it will just ignore string for now.
215+
"""
216+
string="\x00\x01fred\x255"
217+
self.dex.add_text(('test', '1', 'a'), string)
218+
self.assertSeqEqual(self.dex.find(string), [])
210219

211220
def tearDown(self):
212221
shutil.rmtree('test-index')
@@ -247,7 +256,7 @@ def setUp(self):
247256
def tearDown(self):
248257
IndexerTest.tearDown(self)
249258

250-
class Get_IndexerAutoSelectTest(anydbmOpener, unittest.TestCase):
259+
class Get_IndexerTest(anydbmOpener, unittest.TestCase):
251260

252261
def setUp(self):
253262
# remove previous test, ignore errors
@@ -265,26 +274,67 @@ def tearDown(self):
265274
shutil.rmtree(config.DATABASE)
266275

267276
@skip_xapian
268-
def test_xapian_select(self):
277+
def test_xapian_autoselect(self):
269278
indexer = get_indexer(self.db.config, self.db)
270279
self.assertIn('roundup.backends.indexer_xapian.Indexer', str(indexer))
271280

272281
@skip_whoosh
273-
def test_whoosh_select(self):
282+
def test_whoosh_autoselect(self):
274283
import mock, sys
275284
with mock.patch.dict('sys.modules',
276285
{'roundup.backends.indexer_xapian': None}):
277286
indexer = get_indexer(self.db.config, self.db)
278287
self.assertIn('roundup.backends.indexer_whoosh.Indexer', str(indexer))
279288

280-
def test_native_select(self):
289+
def test_native_autoselect(self):
281290
import mock, sys
282291
with mock.patch.dict('sys.modules',
283292
{'roundup.backends.indexer_xapian': None,
284293
'roundup.backends.indexer_whoosh': None}):
285294
indexer = get_indexer(self.db.config, self.db)
286295
self.assertIn('roundup.backends.indexer_dbm.Indexer', str(indexer))
287296

297+
def test_invalid_indexer(self):
298+
"""There is code at the end of indexer_common::get_indexer() to
299+
raise an AssertionError if the indexer name is invalid.
300+
This should never be triggered. If it is, it means that
301+
the code in configure.py that validates indexer names
302+
allows a name through that get_indexer can't handle.
303+
304+
Simulate that failure and make sure that the
305+
AssertionError is raised.
306+
307+
"""
308+
309+
with self.assertRaises(ValueError) as cm:
310+
self.db.config['INDEXER'] = 'no_such_indexer'
311+
312+
# mangle things so we can test AssertionError at end
313+
# get_indexer()
314+
from roundup.configuration import IndexerOption
315+
IndexerOption.allowed.append("unrecognized_indexer")
316+
self.db.config['INDEXER'] = "unrecognized_indexer"
317+
318+
with self.assertRaises(AssertionError) as cm:
319+
indexer = get_indexer(self.db.config, self.db)
320+
321+
# unmangle state
322+
IndexerOption.allowed.pop()
323+
self.assertNotIn("unrecognized_indexer", IndexerOption.allowed)
324+
self.db.config['INDEXER'] = ""
325+
326+
def test_unsupported_by_db(self):
327+
"""This requires that the db associated with the test
328+
is not sqlite or postgres. anydbm works fine to trigger
329+
the error.
330+
"""
331+
self.db.config['INDEXER'] = 'native-fts'
332+
with self.assertRaises(AssertionError) as cm:
333+
get_indexer(self.db.config, self.db)
334+
335+
self.assertIn("native-fts", cm.exception.args[0])
336+
self.db.config['INDEXER'] = ''
337+
288338
class RDBMSIndexerTest(object):
289339
def setUp(self):
290340
# remove previous test, ignore errors
@@ -520,6 +570,26 @@ def test_invalid_language(self):
520570

521571
self.db.config["INDEXER_LANGUAGE"] = "english"
522572

573+
def testNullChar(self):
574+
"""Test with null char in string. Postgres FTS throws a ValueError
575+
on indexing which we ignore. This could happen when
576+
indexing a binary file with a bad mime type. On find, it
577+
throws a ProgrammingError that we remap to
578+
IndexerQueryError and pass up. If a null gets to that
579+
level on search somebody entered it (not sure how you
580+
could actually do that) but we want a crash in that case
581+
as the person is probably up to "no good" (R) (TM).
582+
583+
"""
584+
import psycopg2
585+
586+
string="\x00\x01fred\x255"
587+
self.dex.add_text(('test', '1', 'a'), string)
588+
with self.assertRaises(IndexerQueryError) as ctx:
589+
self.assertSeqEqual(self.dex.find(string), [])
590+
591+
self.assertIn("null", ctx.exception.args[0])
592+
523593
@skip_mysql
524594
class mysqlIndexerTest(mysqlOpener, RDBMSIndexerTest, IndexerTest):
525595
def setUp(self):
@@ -661,4 +731,15 @@ def test_query_errors(self):
661731
error = 'Query error: syntax error near "^"'
662732
self.assertEqual(str(ctx.exception), error)
663733

734+
def testNullChar(self):
735+
"""Test with null char in string. FTS will throw
736+
an error on null.
737+
"""
738+
import psycopg2
739+
740+
string="\x00\x01fred\x255"
741+
self.dex.add_text(('test', '1', 'a'), string)
742+
with self.assertRaises(IndexerQueryError) as cm:
743+
self.assertSeqEqual(self.dex.find(string), [])
744+
664745
# vim: set filetype=python ts=4 sw=4 et si

0 commit comments

Comments
 (0)