[ 
https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015902#comment-13015902
 ] 

Dawid Weiss commented on SOLR-2378:
-----------------------------------

This is a git patch (trim the first level to apply without git, should work out 
of the box) against the trunk that implements FST-based suggester. I've also 
added a more realistic benchmark that shows performance limitations of the 
current suggester implementations, especially in the "onlyMorePopular" mode. 
Some benchmarks from my machine:

Time to create internal data structures out of 50.000 suggestion terms (692kB 
of UTF8):
{noformat}
JaspellLookup   input: 50,000, time[ms]: 30 [+- 4.93]
TSTLookup       input: 50,000, time[ms]: 39 [+- 2.56]
FSTLookup       input: 50,000, time[ms]: 62 [+- 2.52]
{noformat}

Memory used:
{noformat}
JaspellLookup   size[B]:    7,868,863
TSTLookup       size[B]:    7,914,144
FSTLookup       size[B]:      300,148
{noformat}

Traversal speed (randomized prefixes of input sequences). We start with full 
matches:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 108 [+- 6.23], ~qps: 462
TSTLookup       queries: 50000, time[ms]: 54 [+- 1.54], ~qps: 922
FSTLookup       queries: 50000, time[ms]: 66 [+- 1.30], ~qps: 761
{noformat}

These results are overly optimistic because users rarely type in something in 
full; let's cut the prefix length to 6-9 chars:
{noformat}
-- prefixes: 6-9, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 155 [+- 4.20], ~qps: 322
TSTLookup       queries: 50000, time[ms]: 208 [+- 3.99], ~qps: 241
FSTLookup       queries: 50000, time[ms]: 71 [+- 1.36], ~qps: 700
{noformat}

Clearly, the FST is more resilient to the length of the input prefix... let's 
cut it to 2-4 characters:
{noformat}
-- prefixes: 2-4, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 420 [+- 5.37], ~qps: 119
TSTLookup       queries: 50000, time[ms]: 1687 [+- 10.96], ~qps: 30
FSTLookup       queries: 50000, time[ms]: 90 [+- 2.27], ~qps: 554
{noformat}

I didn't have the time to look into it, but TST's result is surprisingly bad 
with such short prefixes. FST keeps nearly the same perf. level.

In the current implementation I throw an exception if somebody tries to get 
suggestions from FST without sorting by weights (this is equivalent to building 
the suggester with a single weight for all terms). This could be implemented, 
but at a small performance penalty -- to be considered if you think it is 
useful. The above performance problems for short prefixes are interestingly 
present even with onlyMorePopular set to false:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 94 [+- 6.45], ~qps: 532
TSTLookup       queries: 50000, time[ms]: 46 [+- 0.98], ~qps: 1092
-- prefixes: 6-9, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 123 [+- 3.12], ~qps: 405
TSTLookup       queries: 50000, time[ms]: 188 [+- 3.03], ~qps: 266
-- prefixes: 2-4, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 225 [+- 5.69], ~qps: 222
TSTLookup       queries: 50000, time[ms]: 1523 [+- 16.69], ~qps: 33
{noformat}

Peek at the code and let me know if you can think of any improvements/ 
modifications, especially wrt Solr infrastructure (I specifically didn't 
implement any real solr instance test case). 


> FST-based Lookup (suggestions) for prefix matches.
> --------------------------------------------------
>
>                 Key: SOLR-2378
>                 URL: https://issues.apache.org/jira/browse/SOLR-2378
>             Project: Solr
>          Issue Type: New Feature
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>              Labels: lookup, prefix
>             Fix For: 4.0
>
>         Attachments: SOLR-2378.patch
>
>
> Implement a subclass of Lookup based on finite state automata/ transducers 
> (Lucene FST package). This issue is for implementing a relatively basic 
> prefix matcher, we will handle infixes and other types of input matches 
> gradually. Impl. phases:
> - -write a DFA based suggester effectively identical to ternary tree based 
> solution right now,-
> - -baseline benchmark against tern. tree (memory consumption, rebuilding 
> speed, indexing speed; reuse Andrzej's benchmark code)-
> - -modify DFA to encode term weights directly in the automaton (optimize for 
> onlyMostPopular case)-
> - -benchmark again-
> - add infix suggestion support with prefix matches boosted higher (?)
> - benchmark again
> - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to