[
https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006052#comment-15006052
]
Paul Elschot commented on LUCENE-6894:
--------------------------------------
Here is the benchmark output, it might be good for future reference:
{code}
TaskQPS baseline StdDevQPS my_modified_version
StdDev Pct diff
HighTerm 178.20 (1.8%) 174.50 (5.6%)
-2.1% ( -9% - 5%)
MedTerm 641.36 (1.6%) 630.32 (4.9%)
-1.7% ( -8% - 4%)
OrHighHigh 57.02 (5.5%) 56.32 (6.6%)
-1.2% ( -12% - 11%)
OrHighMed 107.80 (5.2%) 106.89 (6.2%)
-0.8% ( -11% - 11%)
AndHighHigh 100.02 (2.3%) 99.34 (0.7%)
-0.7% ( -3% - 2%)
LowTerm 2477.28 (3.0%) 2463.27 (5.5%)
-0.6% ( -8% - 8%)
AndHighMed 627.58 (1.5%) 625.22 (1.2%)
-0.4% ( -3% - 2%)
HighPhrase 81.21 (4.2%) 80.98 (4.3%)
-0.3% ( -8% - 8%)
OrHighLow 136.70 (3.1%) 136.35 (2.1%)
-0.3% ( -5% - 5%)
LowPhrase 181.55 (2.2%) 181.09 (2.0%)
-0.3% ( -4% - 4%)
MedSloppyPhrase 56.03 (2.9%) 55.93 (3.1%)
-0.2% ( -5% - 5%)
MedSpanNear 52.77 (1.7%) 52.68 (2.6%)
-0.2% ( -4% - 4%)
LowSloppyPhrase 106.15 (2.9%) 106.01 (3.1%)
-0.1% ( -6% - 6%)
MedPhrase 39.38 (3.8%) 39.36 (3.3%)
-0.1% ( -6% - 7%)
Fuzzy1 137.14 (2.1%) 137.06 (1.5%)
-0.1% ( -3% - 3%)
Fuzzy2 79.28 (1.9%) 79.25 (1.5%)
-0.0% ( -3% - 3%)
LowSpanNear 94.38 (1.7%) 94.35 (2.8%)
-0.0% ( -4% - 4%)
OrNotHighMed 444.12 (1.7%) 444.36 (1.2%)
0.1% ( -2% - 2%)
AndHighLow 1878.59 (2.0%) 1880.20 (1.9%)
0.1% ( -3% - 4%)
Respell 106.47 (1.9%) 106.62 (1.7%)
0.1% ( -3% - 3%)
OrNotHighLow 1831.85 (1.7%) 1834.68 (1.3%)
0.2% ( -2% - 3%)
OrNotHighHigh 69.75 (1.6%) 69.91 (1.4%)
0.2% ( -2% - 3%)
HighSpanNear 36.38 (2.8%) 36.47 (3.8%)
0.3% ( -6% - 7%)
HighSloppyPhrase 45.58 (3.6%) 45.70 (3.5%)
0.3% ( -6% - 7%)
OrHighNotLow 65.78 (7.0%) 66.03 (8.4%)
0.4% ( -14% - 16%)
Prefix3 448.85 (3.5%) 450.67 (3.8%)
0.4% ( -6% - 8%)
Wildcard 114.35 (4.8%) 115.02 (4.6%)
0.6% ( -8% - 10%)
IntNRQ 23.48 (7.4%) 23.71 (7.7%)
1.0% ( -13% - 17%)
PKLookup 360.70 (1.7%) 364.91 (3.1%)
1.2% ( -3% - 6%)
OrHighNotMed 178.99 (7.2%) 181.91 (8.2%)
1.6% ( -12% - 18%)
OrHighNotHigh 39.78 (7.1%) 40.63 (7.5%)
2.1% ( -11% - 18%)
{code}
> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
> Key: LUCENE-6894
> URL: https://issues.apache.org/jira/browse/LUCENE-6894
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Paul Elschot
> Priority: Minor
> Attachments: LUCENE-6894.patch, LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of
> matching docs. Currently conjunctions use the minimum cost, and disjunctions
> use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the
> number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the
> disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost()
> results are only used to order the input DISIs/Scorers for optimization, and
> for that I expect this assumption to work nicely.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]