[
https://issues.apache.org/jira/browse/LUCENE-6421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-6421:
--------------------------------
Attachment: LUCENE-6421_luceneutil.patch
LUCENE-6421.patch
See attached patch and benchmarks modifications / tasks file.
* no longer keeps subs "one document ahead", its like a normal disjunction
* positions reading/merging are deferred until freq() is called.
* general cleanups
The problems with the current code is more than just two-phase iteration,
because it always reads all positions from all subs on nextDoc()/advance(), it
slows down even the simplest multiphrase queries like these added to the tasks
file:
{noformat}
MultiPhraseHHH: multiPhrase//(body:in|of the)
MultiPhraseHHM: multiPhrase//(body:in|of your)
MultiPhraseHHL: multiPhrase//(body:in|of harvard)
MultiPhraseMMH: multiPhrase//(body:northern|southern states)
MultiPhraseMMM: multiPhrase//(body:northern|southern usa)
MultiPhraseMML: multiPhrase//(body:northern|southern iraq)
{noformat}
So in the example of northern|southern states, today all positions are read
from either or both 'northern' and 'southern', regardless of whether 'states'
is present in the doc at all. Filters will only aggravate the situation even
more.
Benchmarking these is super-slow, but after a few iterations it looks like this:
{noformat}
Task QPS trunk StdDev QPS patch StdDev
Pct diff
MultiPhraseHHH 0.34 (2.1%) 0.33 (1.4%)
-2.1% ( -5% - 1%)
MultiPhraseHHL 17.26 (0.7%) 17.67 (0.5%)
2.3% ( 1% - 3%)
MultiPhraseHHM 5.13 (1.6%) 5.34 (0.3%)
4.1% ( 2% - 6%)
MultiPhraseMMH 33.99 (1.3%) 39.19 (0.7%)
15.3% ( 13% - 17%)
MultiPhraseMML 160.11 (0.2%) 202.29 (0.6%)
26.3% ( 25% - 27%)
MultiPhraseMMM 72.20 (1.7%) 95.66 (2.0%)
32.5% ( 28% - 36%)
{noformat}
> Add two-phase support to MultiPhraseQuery
> -----------------------------------------
>
> Key: LUCENE-6421
> URL: https://issues.apache.org/jira/browse/LUCENE-6421
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-6421.patch, LUCENE-6421_luceneutil.patch
>
>
> Two-phase support currently works for both sloppy and exact Scorers but it
> does not work if you have multiple terms at the same position
> (MultiPhraseQuery).
> This is because UnionPostingsEnum.nextDoc() aggressively reads and merges all
> the positions. Even making this initialization lazy might just be enough?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]