Skip to content

Commit a2bf952

Browse files
committed
- issue2550636, issue2550909: Added support for Whoosh indexer.
Also adds new config.ini setting called indexer to select indexer. See ``doc/upgrading.txt`` for details. Initial patch done by David Wolever. Patch modified (see ticket or below for changes), docs updated and committed. I have an outstanding issue with test/test_indexer.py. I have to comment out all imports and tests for indexers I don't have (i.e. mysql, postgres) otherwise no tests run. With that change made, dbm, sqlite (rdbms), xapian and whoosh indexes are all passing the indexer tests. Changes summary: 1) support native back ends dbm and rdbms. (original patch only fell through to dbm) 2) Developed whoosh stopfilter to not index stopwords or words outside the the maxlength and minlength limits defined in index_common.py. Required to pass the extremewords test_indexer test. Also I removed a call to .lower on the input text as the tokenizer I chose automatically does the lowercase. 3) Added support for max/min length to find. This was needed to pass extremewords test. 4) Added back a call to save_index in add_text. This allowed all but two tests to pass. 5) Fixed a call to: results = searcher.search(query.Term("identifier", identifier)) which had an extra parameter that is an error under current whoosh. 6) Set limit=None in search call for find() otherwise it only return 10 items. This allowed it to pass manyresults test Also due to changes in the roundup code removed the call in indexer_whoosh to from roundup.anypy.sets_ import set since we use the python builtin set.
1 parent 9507679 commit a2bf952

File tree

13 files changed

+275
-12
lines changed

13 files changed

+275
-12
lines changed

CHANGES.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,11 @@ Features:
7373
for description. Merge request at:
7474
https://sourceforge.net/p/roundup/code/merge-requests/1/
7575
Patch supplied by kinggreedy. Applied/tested by John Rouillard
76+
- issue2550636, issue2550909: Added support for Whoosh indexer.
77+
Also adds new config.ini setting called indexer to select
78+
indexer. See ``doc/upgrading.txt`` for details. Initial patch
79+
done by David Wolever. Patch modified, docs added and committed
80+
by John Rouillard.
7681

7782
Fixed:
7883

doc/features.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ from Ka-Ping Yee in the Software Carpentry "Track" design competition.
4747
support them (sqlite, mysql and postgresql)
4848
- indexed text searching giving fast responses to searches across all
4949
messages and indexed string properties
50-
- support for the Xapian full-text indexing engine for large trackers
50+
- support for the Xapian or Whoosh full-text indexing engine for large trackers
5151

5252
*documented*
5353
- documentation exists for installation, upgrading, maintenance, users and

doc/installation.txt

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,20 @@ Xapian full-text indexer
6767

6868
Roundup requires Xapian 1.0.0 or newer.
6969

70+
Whoosh full-text indexer
71+
The Whoosh_ full-text indexer is also supported and will be used by
72+
default if it is available (and Xapian is not installed). This is
73+
recommended if you are anticipating a large number of issues (> 5000).
74+
75+
You may install Whoosh at any time, even after a tracker has been
76+
installed and used. You will need to run the "roundup-admin reindex"
77+
command if the tracker has existing data.
78+
79+
Roundup was tested with Whoosh 2.5.7, but earlier versions in the
80+
2.0 series may work. Whoosh is a pure python indexer so it is slower
81+
than Xapian, but should be useful for moderately sized trackers.
82+
It uses the StandardAnalyzer which is suited for Western languages.
83+
7084
pyopenssl
7185
If pyopenssl_ is installed the roundup-server can be configured
7286
to serve trackers over SSL. If you are going to serve roundup via
@@ -88,6 +102,7 @@ Windows Service
88102
You can run Roundup as a Windows service if pywin32_ is installed.
89103

90104
.. _Xapian: http://xapian.org/
105+
.. _Whoosh: https://bitbucket.org/mchaput/whoosh/wiki/Home
91106
.. _pytz: http://www.python.org/pypi/pytz
92107
.. _Olson tz database: http://www.twinsun.com/tz/tz-link.htm
93108
.. _pyopenssl: http://pyopenssl.sourceforge.net

doc/upgrading.txt

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The ``db/backend_name`` file is no longer used to configure the database
3030
backend being used for a tracker. The backend is now configured in the
3131
``config.ini`` file using the ``backend`` option located in the ``[rdbms]``
3232
section. For example if ``db/backend_name`` file contains ``sqlite``, a new
33-
entry in the ``config.ini`` will need to be created::
33+
entry in the tracker's ``config.ini`` will need to be created::
3434

3535
[rdbms]
3636

@@ -47,6 +47,24 @@ Note: the ``backend_name`` file may be located in a directory other than
4747
``db/`` if you have configured the ``database`` option in the ``[main]``
4848
section of the ``config.ini`` file to be something other than ``db``.
4949

50+
New config file option 'indexer' added
51+
--------------------------------------
52+
53+
With support for the Whoosh indexer, a new config file option has been
54+
added. You can force Roundup to use a particular text indexer by
55+
setting this value in the [main] section of the tracker's
56+
``config.ini`` file (usually placed right before indexer_stopwords)::
57+
58+
[main]
59+
60+
...
61+
62+
# Force Roundup to use a particular text indexer.
63+
# If no indexer is supplied, the first available indexer
64+
# will be used in the following order:
65+
# Possible values: xapian, whoosh, native (internal).
66+
indexer =
67+
5068
html/_generic.404.html in trackers use page template
5169
----------------------------------------------------
5270

roundup/backends/back_anydbm.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -33,10 +33,7 @@
3333
from roundup.backends.blobfiles import FileStorage
3434
from roundup.backends.sessions_dbm import Sessions, OneTimeKeys
3535

36-
try:
37-
from roundup.backends.indexer_xapian import Indexer
38-
except ImportError:
39-
from roundup.backends.indexer_dbm import Indexer
36+
from roundup.backends.indexer_common import get_indexer
4037

4138
def db_exists(config):
4239
# check for the user db
@@ -140,7 +137,17 @@ class Database(FileStorage, hyperdb.Database, roundupdb.Database):
140137
- check the timestamp of the class file and nuke the cache if it's
141138
modified. Do some sort of conflict checking on the dirty stuff.
142139
- perhaps detect write collisions (related to above)?
140+
141+
attributes:
142+
dbtype:
143+
holds the value for the type of db. It is used by indexer to
144+
identify the database type so it can import the correct indexer
145+
module when using native text search mode.
143146
"""
147+
148+
dbtype = "anydbm"
149+
150+
144151
def __init__(self, config, journaltag=None):
145152
"""Open a hyperdatabase given a specifier to some storage.
146153
@@ -167,7 +174,7 @@ def __init__(self, config, journaltag=None):
167174
self.newnodes = {} # keep track of the new nodes by class
168175
self.destroyednodes = {}# keep track of the destroyed nodes by class
169176
self.transactions = []
170-
self.indexer = Indexer(self)
177+
self.indexer = get_indexer(config, self)
171178
self.security = security.Security(self)
172179
os.umask(config.UMASK)
173180

roundup/backends/back_mysql.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,19 @@ def db_exists(config):
110110

111111

112112
class Database(rdbms_common.Database):
113+
""" Mysql DB backend implementation
114+
115+
attributes:
116+
dbtype:
117+
holds the value for the type of db. It is used by indexer to
118+
identify the database type so it can import the correct indexer
119+
module when using native text search mode.
120+
"""
121+
113122
arg = '%s'
114123

124+
dbtype = "mysql"
125+
115126
# used by some code to switch styles of query
116127
implements_intersect = 0
117128

roundup/backends/back_postgresql.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,8 +151,19 @@ def set(self, *args, **kwargs):
151151
self.db.rollback()
152152

153153
class Database(rdbms_common.Database):
154+
"""Postgres DB backend implementation
155+
156+
attributes:
157+
dbtype:
158+
holds the value for the type of db. It is used by indexer to
159+
identify the database type so it can import the correct indexer
160+
module when using native text search mode.
161+
"""
162+
154163
arg = '%s'
155164

165+
dbtype = "postgres"
166+
156167
# used by some code to switch styles of query
157168
implements_intersect = 1
158169

roundup/backends/back_sqlite.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,23 @@ def db_nuke(config):
3434
shutil.rmtree(config.DATABASE)
3535

3636
class Database(rdbms_common.Database):
37+
"""Sqlite DB backend implementation
38+
39+
attributes:
40+
dbtype:
41+
holds the value for the type of db. It is used by indexer to
42+
identify the database type so it can import the correct indexer
43+
module when using native text search mode.
44+
"""
45+
3746
# char to use for positional arguments
3847
if sqlite_version in (2,3):
3948
arg = '?'
4049
else:
4150
arg = '%s'
4251

52+
dbtype = "sqlite"
53+
4354
# used by some code to switch styles of query
4455
implements_intersect = 1
4556

roundup/backends/indexer_common.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,3 +107,41 @@ def search(self, search_terms, klass, ignore={}):
107107
node_dict[linkprop].append(nodeid)
108108
return nodeids
109109

110+
def get_indexer(config, db):
111+
indexer_name = getattr(config, "INDEXER", "")
112+
if not indexer_name:
113+
# Try everything
114+
try:
115+
from indexer_xapian import Indexer
116+
return Indexer(db)
117+
except ImportError:
118+
pass
119+
120+
try:
121+
from indexer_whoosh import Indexer
122+
return Indexer(db)
123+
except ImportError:
124+
pass
125+
126+
indexer_name = "native" # fallback to native full text search
127+
128+
if indexer_name == "xapian":
129+
from indexer_xapian import Indexer
130+
return Indexer(db)
131+
132+
if indexer_name == "whoosh":
133+
from indexer_whoosh import Indexer
134+
return Indexer(db)
135+
136+
if indexer_name == "native":
137+
# load proper native indexing based on database type
138+
if db.dbtype == "anydbm":
139+
from roundup.backends.indexer_dbm import Indexer
140+
return Indexer(db)
141+
142+
if db.dbtype in ("sqlite", "postgres", "mysql"):
143+
from roundup.backends.indexer_rdbms import Indexer
144+
return Indexer(db)
145+
146+
raise AssertionError("Invalid indexer: %r" %(indexer_name))
147+

roundup/backends/indexer_whoosh.py

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
''' This implements the full-text indexer using Whoosh.
2+
'''
3+
import re, os
4+
5+
from whoosh import fields, qparser, index, query, analysis
6+
7+
from roundup.backends.indexer_common import Indexer as IndexerBase
8+
9+
class Indexer(IndexerBase):
10+
def __init__(self, db):
11+
IndexerBase.__init__(self, db)
12+
self.db_path = db.config.DATABASE
13+
self.reindex = 0
14+
self.writer = None
15+
self.index = None
16+
self.deleted = set()
17+
18+
def _get_index(self):
19+
if self.index is None:
20+
path = os.path.join(self.db_path, 'whoosh-index')
21+
if not os.path.exists(path):
22+
# StandardAnalyzer lowercases all words and configure it to
23+
# block stopwords and words with lengths not between
24+
# self.minlength and self.maxlength from indexer_common
25+
stopfilter = analysis.StandardAnalyzer( #stoplist=self.stopwords,
26+
minsize=self.minlength,
27+
maxsize=self.maxlength)
28+
os.mkdir(path)
29+
schema = fields.Schema(identifier=fields.ID(stored=True,
30+
unique=True),
31+
content=fields.TEXT(analyzer=stopfilter))
32+
index.create_in(path, schema)
33+
self.index = index.open_dir(path)
34+
return self.index
35+
36+
def save_index(self):
37+
'''Save the changes to the index.'''
38+
if not self.writer:
39+
return
40+
self.writer.commit()
41+
self.deleted = set()
42+
self.writer = None
43+
44+
def close(self):
45+
'''close the indexing database'''
46+
pass
47+
48+
def rollback(self):
49+
if not self.writer:
50+
return
51+
self.writer.cancel()
52+
self.deleted = set()
53+
self.writer = None
54+
55+
def force_reindex(self):
56+
'''Force a reindexing of the database. This essentially
57+
empties the tables ids and index and sets a flag so
58+
that the databases are reindexed'''
59+
self.reindex = 1
60+
61+
def should_reindex(self):
62+
'''returns True if the indexes need to be rebuilt'''
63+
return self.reindex
64+
65+
def _get_writer(self):
66+
if self.writer is None:
67+
self.writer = self._get_index().writer()
68+
return self.writer
69+
70+
def _get_searcher(self):
71+
return self._get_index().searcher()
72+
73+
def add_text(self, identifier, text, mime_type='text/plain'):
74+
''' "identifier" is (classname, itemid, property) '''
75+
if mime_type != 'text/plain':
76+
return
77+
78+
if not text:
79+
text = u''
80+
81+
if not isinstance(text, unicode):
82+
text = unicode(text, "utf-8", "replace")
83+
84+
# We use the identifier twice: once in the actual "text" being
85+
# indexed so we can search on it, and again as the "data" being
86+
# indexed so we know what we're matching when we get results
87+
identifier = u"%s:%s:%s"%identifier
88+
89+
# FIXME need to enhance this to handle the whoosh.store.LockError
90+
# that maybe raised if there is already another process with a lock.
91+
writer = self._get_writer()
92+
93+
# Whoosh gets upset if a document is deleted twice in one transaction,
94+
# so we keep a list of the documents we have so far deleted to make
95+
# sure that we only delete them once.
96+
if identifier not in self.deleted:
97+
searcher = self._get_searcher()
98+
results = searcher.search(query.Term("identifier", identifier))
99+
if len(results) > 0:
100+
writer.delete_by_term("identifier", identifier)
101+
self.deleted.add(identifier)
102+
103+
# Note: use '.lower()' because it seems like Whoosh gets
104+
# better results that way.
105+
writer.add_document(identifier=identifier, content=text)
106+
self.save_index()
107+
108+
def find(self, wordlist):
109+
'''look up all the words in the wordlist.
110+
If none are found return an empty dictionary
111+
* more rules here
112+
'''
113+
114+
wordlist = [ word for word in wordlist
115+
if (self.minlength <= len(word) <= self.maxlength) and
116+
not self.is_stopword(word.upper()) ]
117+
118+
if not wordlist:
119+
return {}
120+
121+
searcher = self._get_searcher()
122+
q = query.And([ query.FuzzyTerm("content", word.lower())
123+
for word in wordlist ])
124+
125+
results = searcher.search(q, limit=None)
126+
127+
return [tuple(result["identifier"].split(':'))
128+
for result in results]
129+

0 commit comments

Comments
 (0)