[jira] [Updated] (LUCENE-6421) Add two-phase support to MultiPhraseQuery

Robert Muir (JIRA) Mon, 13 Apr 2015 22:06:36 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-6421:
--------------------------------
    Attachment: LUCENE-6421_luceneutil.patch
                LUCENE-6421.patch

See attached patch and benchmarks modifications / tasks file.

* no longer keeps subs "one document ahead", its like a normal disjunction
* positions reading/merging are deferred until freq() is called.
* general cleanups

The problems with the current code is more than just two-phase iteration, 
because it always reads all positions from all subs on nextDoc()/advance(), it 
slows down even the simplest multiphrase queries like these added to the tasks 
file:
{noformat}
MultiPhraseHHH: multiPhrase//(body:in|of the)
MultiPhraseHHM: multiPhrase//(body:in|of your)
MultiPhraseHHL: multiPhrase//(body:in|of harvard)
MultiPhraseMMH: multiPhrase//(body:northern|southern states)
MultiPhraseMMM: multiPhrase//(body:northern|southern usa)
MultiPhraseMML: multiPhrase//(body:northern|southern iraq)
{noformat}

So in the example of northern|southern states, today all positions are read 
from either or both 'northern' and 'southern', regardless of whether 'states' 
is present in the doc at all. Filters will only aggravate the situation even 
more. 

Benchmarking these is super-slow, but after a few iterations it looks like this:
{noformat}
                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff
          MultiPhraseHHH        0.34      (2.1%)        0.33      (1.4%)   
-2.1% (  -5% -    1%)
          MultiPhraseHHL       17.26      (0.7%)       17.67      (0.5%)    
2.3% (   1% -    3%)
          MultiPhraseHHM        5.13      (1.6%)        5.34      (0.3%)    
4.1% (   2% -    6%)
          MultiPhraseMMH       33.99      (1.3%)       39.19      (0.7%)   
15.3% (  13% -   17%)
          MultiPhraseMML      160.11      (0.2%)      202.29      (0.6%)   
26.3% (  25% -   27%)
          MultiPhraseMMM       72.20      (1.7%)       95.66      (2.0%)   
32.5% (  28% -   36%)
{noformat}

> Add two-phase support to MultiPhraseQuery
> -----------------------------------------
>
>                 Key: LUCENE-6421
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6421
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>         Attachments: LUCENE-6421.patch, LUCENE-6421_luceneutil.patch
>
>
> Two-phase support currently works for both sloppy and exact Scorers but it 
> does not work if you have multiple terms at the same position 
> (MultiPhraseQuery).
> This is because UnionPostingsEnum.nextDoc() aggressively reads and merges all 
> the positions. Even making this initialization lazy might just be enough?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6421) Add two-phase support to MultiPhraseQuery

Reply via email to