Murali Krishna P created LUCENE-7897:
----------------------------------------
Summary: RangeQuery optimization in IndexOrDocValuesQuery
Key: LUCENE-7897
URL: https://issues.apache.org/jira/browse/LUCENE-7897
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Affects Versions: trunk, 7.0
Reporter: Murali Krishna P
For range queries, Lucene uses either Points or Docvalues based on cost
estimation
(https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/IndexOrDocValuesQuery.html).
Scorer is chosen based on the minCost here:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/Boolean2ScorerSupplier.java#L16
However, the cost calculation for TermQuery and IndexOrDocvalueQuery seems to
have same weightage. Essentially, cost depends upon the docfreq in TermDict,
number of points visited and number of docvalues. In a situation where docfreq
is not too restrictive, this is lot of lookups for docvalues and using points
would have been better.
Following query with 1M matches, takes 60ms with docvalues, but only 27ms with
points. If I change the query to "message:*", which matches all docs, it choses
the points(since cost is same), but with message:xyz it choses docvalues
eventhough doc frequency is 1million which results in many docvalue fetches.
Would it make sense to change the cost of docvalues query to be higher or use
points if the docfreq is too high for the term query(find an optimum threshold
where points cost < docvalue cost)?
{noformat}
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "message:xyz"
}
},
{
"range": {
"@timestamp": {
"gte": 1498652400000,
"lte": 1498905000000,
"format": "epoch_millis"
}
}
}
]
}
}
}
{noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]