[
https://issues.apache.org/jira/browse/LUCENE-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-6184:
---------------------------------
Attachment: LUCENE-6184.patch
Here is a patch:
- BulkScorer now returns a hint on the next matching doc after {{max}}
- BooleanScorer uses this information in order to only score windows of
documents where at least one clause matches (by putting the bulk scorers into a
priority queue)
This helps boolean queries with dense clauses since this helped remove the
{{hasMatches}} optimization which helps not iterate over the bit set if there
are no matches but had the drawback of making OrCollector.collect
heavier.
And this helps boolean queries with very sparse clauses since they now only
collect windows where they have matches.
Here is the result of the luceneutil benchmark on the 10M wikipedia corpus. I
added some tasks to test sparse clauses: VeryLow is for term queries that have
a doc freq between 400 and 500, and "VeryLowVeryLow" is a disjunction of 2 such
terms:
{code}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff
HighSloppyPhrase 32.70 (4.3%) 32.39 (4.0%)
-1.0% ( -8% - 7%)
Prefix3 162.73 (5.8%) 161.32 (6.6%)
-0.9% ( -12% - 12%)
LowTerm 803.22 (6.2%) 797.47 (6.2%)
-0.7% ( -12% - 12%)
IntNRQ 13.84 (6.9%) 13.75 (7.3%)
-0.7% ( -13% - 14%)
OrHighNotLow 60.36 (2.7%) 59.96 (3.9%)
-0.7% ( -7% - 6%)
LowSloppyPhrase 17.94 (3.0%) 17.82 (2.8%)
-0.7% ( -6% - 5%)
VeryLow 6095.14 (5.8%) 6057.73 (5.0%)
-0.6% ( -10% - 10%)
LowPhrase 276.59 (2.2%) 274.97 (1.6%)
-0.6% ( -4% - 3%)
OrHighNotMed 43.56 (2.6%) 43.32 (3.3%)
-0.6% ( -6% - 5%)
OrNotHighLow 924.37 (2.5%) 919.21 (2.4%)
-0.6% ( -5% - 4%)
AndHighLow 703.38 (2.9%) 699.62 (3.6%)
-0.5% ( -6% - 6%)
Wildcard 93.74 (3.1%) 93.29 (3.0%)
-0.5% ( -6% - 5%)
MedSloppyPhrase 79.24 (2.8%) 78.91 (2.3%)
-0.4% ( -5% - 4%)
OrNotHighMed 207.14 (2.0%) 206.31 (2.2%)
-0.4% ( -4% - 3%)
HighSpanNear 12.56 (0.9%) 12.53 (1.1%)
-0.2% ( -2% - 1%)
HighPhrase 13.58 (2.3%) 13.55 (2.1%)
-0.2% ( -4% - 4%)
OrHighNotHigh 33.29 (1.6%) 33.24 (2.0%)
-0.2% ( -3% - 3%)
OrNotHighHigh 56.10 (1.6%) 56.00 (1.8%)
-0.2% ( -3% - 3%)
HighTerm 91.52 (2.6%) 91.37 (2.7%)
-0.2% ( -5% - 5%)
Respell 71.63 (5.5%) 71.52 (5.3%)
-0.1% ( -10% - 11%)
LowSpanNear 18.17 (1.0%) 18.16 (0.8%)
-0.1% ( -1% - 1%)
MedTerm 146.69 (2.5%) 146.56 (3.0%)
-0.1% ( -5% - 5%)
AndHighMed 274.22 (2.6%) 274.00 (2.3%)
-0.1% ( -4% - 4%)
MedSpanNear 31.01 (0.9%) 31.00 (1.1%)
-0.0% ( -1% - 1%)
AndHighHigh 77.34 (1.8%) 77.32 (1.7%)
-0.0% ( -3% - 3%)
MedPhrase 19.10 (6.2%) 19.10 (6.2%)
0.0% ( -11% - 13%)
Fuzzy2 26.84 (6.8%) 26.88 (7.6%)
0.1% ( -13% - 15%)
PKLookup 272.91 (3.1%) 274.16 (2.7%)
0.5% ( -5% - 6%)
OrHighMed 59.25 (11.8%) 62.90 (6.5%)
6.2% ( -10% - 27%)
OrHighLow 64.54 (11.9%) 68.73 (6.5%)
6.5% ( -10% - 28%)
OrHighHigh 42.89 (12.2%) 45.77 (6.9%)
6.7% ( -11% - 29%)
Fuzzy1 95.20 (4.2%) 101.65 (5.9%)
6.8% ( -3% - 17%)
VeryLowVeryLow 1936.31 (3.2%) 2263.44 (3.3%)
16.9% ( 10% - 24%)
{code}
> BooleanScorer should better deal with sparse clauses
> ----------------------------------------------------
>
> Key: LUCENE-6184
> URL: https://issues.apache.org/jira/browse/LUCENE-6184
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Fix For: Trunk, 5.1
>
> Attachments: LUCENE-6184.patch
>
>
> The way that BooleanScorer works looks like this:
> {code}
> for each (window of 2048 docs) {
> for each (optional scorer) {
> scorer.score(window)
> }
> }
> {code}
> This is not efficient for very sparse clauses (doc freq much lower than
> maxDoc/2048) since we keep on scoring windows of documents that do not match
> anything. BooleanScorer2 currently performs better in those cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]