[
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015902#comment-13015902
]
Dawid Weiss commented on SOLR-2378:
-----------------------------------
This is a git patch (trim the first level to apply without git, should work out
of the box) against the trunk that implements FST-based suggester. I've also
added a more realistic benchmark that shows performance limitations of the
current suggester implementations, especially in the "onlyMorePopular" mode.
Some benchmarks from my machine:
Time to create internal data structures out of 50.000 suggestion terms (692kB
of UTF8):
{noformat}
JaspellLookup input: 50,000, time[ms]: 30 [+- 4.93]
TSTLookup input: 50,000, time[ms]: 39 [+- 2.56]
FSTLookup input: 50,000, time[ms]: 62 [+- 2.52]
{noformat}
Memory used:
{noformat}
JaspellLookup size[B]: 7,868,863
TSTLookup size[B]: 7,914,144
FSTLookup size[B]: 300,148
{noformat}
Traversal speed (randomized prefixes of input sequences). We start with full
matches:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: true
JaspellLookup queries: 50000, time[ms]: 108 [+- 6.23], ~qps: 462
TSTLookup queries: 50000, time[ms]: 54 [+- 1.54], ~qps: 922
FSTLookup queries: 50000, time[ms]: 66 [+- 1.30], ~qps: 761
{noformat}
These results are overly optimistic because users rarely type in something in
full; let's cut the prefix length to 6-9 chars:
{noformat}
-- prefixes: 6-9, num: 7, onlyMorePopular: true
JaspellLookup queries: 50000, time[ms]: 155 [+- 4.20], ~qps: 322
TSTLookup queries: 50000, time[ms]: 208 [+- 3.99], ~qps: 241
FSTLookup queries: 50000, time[ms]: 71 [+- 1.36], ~qps: 700
{noformat}
Clearly, the FST is more resilient to the length of the input prefix... let's
cut it to 2-4 characters:
{noformat}
-- prefixes: 2-4, num: 7, onlyMorePopular: true
JaspellLookup queries: 50000, time[ms]: 420 [+- 5.37], ~qps: 119
TSTLookup queries: 50000, time[ms]: 1687 [+- 10.96], ~qps: 30
FSTLookup queries: 50000, time[ms]: 90 [+- 2.27], ~qps: 554
{noformat}
I didn't have the time to look into it, but TST's result is surprisingly bad
with such short prefixes. FST keeps nearly the same perf. level.
In the current implementation I throw an exception if somebody tries to get
suggestions from FST without sorting by weights (this is equivalent to building
the suggester with a single weight for all terms). This could be implemented,
but at a small performance penalty -- to be considered if you think it is
useful. The above performance problems for short prefixes are interestingly
present even with onlyMorePopular set to false:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: false
JaspellLookup queries: 50000, time[ms]: 94 [+- 6.45], ~qps: 532
TSTLookup queries: 50000, time[ms]: 46 [+- 0.98], ~qps: 1092
-- prefixes: 6-9, num: 7, onlyMorePopular: false
JaspellLookup queries: 50000, time[ms]: 123 [+- 3.12], ~qps: 405
TSTLookup queries: 50000, time[ms]: 188 [+- 3.03], ~qps: 266
-- prefixes: 2-4, num: 7, onlyMorePopular: false
JaspellLookup queries: 50000, time[ms]: 225 [+- 5.69], ~qps: 222
TSTLookup queries: 50000, time[ms]: 1523 [+- 16.69], ~qps: 33
{noformat}
Peek at the code and let me know if you can think of any improvements/
modifications, especially wrt Solr infrastructure (I specifically didn't
implement any real solr instance test case).
> FST-based Lookup (suggestions) for prefix matches.
> --------------------------------------------------
>
> Key: SOLR-2378
> URL: https://issues.apache.org/jira/browse/SOLR-2378
> Project: Solr
> Issue Type: New Feature
> Components: spellchecker
> Reporter: Dawid Weiss
> Assignee: Dawid Weiss
> Labels: lookup, prefix
> Fix For: 4.0
>
> Attachments: SOLR-2378.patch
>
>
> Implement a subclass of Lookup based on finite state automata/ transducers
> (Lucene FST package). This issue is for implementing a relatively basic
> prefix matcher, we will handle infixes and other types of input matches
> gradually. Impl. phases:
> - -write a DFA based suggester effectively identical to ternary tree based
> solution right now,-
> - -baseline benchmark against tern. tree (memory consumption, rebuilding
> speed, indexing speed; reuse Andrzej's benchmark code)-
> - -modify DFA to encode term weights directly in the automaton (optimize for
> onlyMostPopular case)-
> - -benchmark again-
> - add infix suggestion support with prefix matches boosted higher (?)
> - benchmark again
> - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]