Ferenczi Jim created LUCENE-7423:
------------------------------------
Summary: AutoPrefixPostingsFormat: a PostingsFormat optimized for
prefix queries on text fields.
Key: LUCENE-7423
URL: https://issues.apache.org/jira/browse/LUCENE-7423
Project: Lucene - Core
Issue Type: New Feature
Components: modules/sandbox
Reporter: Ferenczi Jim
Priority: Minor
The autoprefix terms dict added in
https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with
https://issues.apache.org/jira/browse/LUCENE-7317.
The new points API is now used to do efficient range queries but the
replacement for prefix string queries is unclear. The edge ngrams could be used
instead but they have a lot of drawbacks and are hard to configure correctly.
The completion postings format is also a good replacement but it requires to
have a big FST in RAM and it cannot be intersected with other fields.
This patch is a proposal for a new PostingsFormat optimized for prefix query on
string fields. It detects prefixes that match "enough" terms and writes
auto-prefix terms into their own virtual field.
At search time the virtual field is used to speed up prefix queries that match
"enough" terms.
The auto-prefix terms are built in two pass:
* The first pass builds a compact prefix tree. Since the terms enum is sorted
the prefixes are flushed on the fly depending on the input. For each prefix we
build its corresponding inverted lists using a DocIdSetBuilder. The first pass
visits each term of the field TermsEnum only once. When a prefix is flushed
from the prefix tree its inverted lists is dumped into a temporary file for
further use. This is necessary since the prefixes are not sorted when they are
removed from the tree. The selected auto prefixes are sorted at the end of the
first pass.
* The second pass is a sorted scan of the prefixes and the temporary file is
used to read the corresponding inverted lists.
The patch is just a POC and there are rooms for optimizations but the first
results are promising:
I tested the patch with the geonames dataset. I indexed all the titles with the
KeywordAnalyzer and compared the index/merge time and the size of the indices.
The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes
572M on disk and it took 130s to index and optimize the 11M titles.
The auto prefix index takes 287M on disk and took 70s to index and optimize the
same 11M titles. Among the 287M, only 170M are used for the auto prefix fields
and the rest is for the regular keyword field. All the auto prefixes were
generated for this test (at least 2 terms per auto-prefix).
The queries have similar performance since we are sure on both sides that one
inverted list can answer any prefix query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]