Re: Lucene performance benchmark | search throughput

Michael McCandless Fri, 06 Jan 2017 04:40:28 -0800

The cost() method on DocIdSetIterator is responsible for telling
BooleanQuery how costly that clause is, and how cost() is implemented
varies by query.


For the multi-term queries, like WildcardQuery, Lucene will first
visit all matched terms (during the Query.rewrite phase), and rewrite
the query either into a disjunction (SHOULD of the N terms), or it
will, per segment, visit all docs for all matching terms, setting them
in a sparse or dense bitset, recording the cost as the number of
documents.

But there is work underway now to try to improve the multi-term query
cases so that we don't go and do all that up-front work (visiting all
terms, and all docs matching each term) when another clause in the
boolean query is more restrictive:
https://issues.apache.org/jira/browse/LUCENE-7055

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 6, 2017 at 2:28 AM, Rajnish kamboj <rajnishk7.i...@gmail.com> wrote:
> OK, got it
>
> One thing still I need to know (which is not clear to me)....
> How does Lucene calculates the most restrictive clause?
>
> Correct me, if I am wrong in my understanding (in abstract):
> 1. During indexing, Lucene keeps information of documents count against
> every indexed items.
> 2. During search, it first checks, which condition has less number of
> documents count before actually iterating.
> 3. Then, it iterates that restricted set against other set of conditions.
>
> If the above is correct then how does Lucene calculates most restrictive
> clause in case of Wildcard conditions?
> Also, if Lucene first check for most restrictive clause, and then iterate to
> match documents to the other clauses,
>         Then when will the merging of documents happen?
>
> Coming on to my main query for which I ask question in Lucene community:
> What is the search performance benchmark against Lucene version, so that I
> can benchmark my application throughput?
>
>
> Regards
> Rajnish
>
> On Tue, Jan 3, 2017 at 6:09 PM, Rajnish kamboj <rajnishk7.i...@gmail.com>
> wrote:
>>
>> OK, got it
>>
>> One thing still I need to know (which is not clear to me)....
>> How does Lucene calculates the most restrictive clause?
>>
>> Correct me, if I am wrong in my understanding (in abstract):
>> 1. During indexing, Lucene keeps information of documents count against
>> every indexed items.
>> 2. During search, it first checks, which condition has less number of
>> documents count before actually iterating.
>> 3. Then, it iterates that restricted set against other set of conditions.
>>
>> If the above is correct then how does Lucene calculates most restrictive
>> clause in case of Wildcard conditions?
>> Also, if Lucene first check for most restrictive clause, and then iterate
>> to match documents to the other clauses,
>>         Then when will the merging of documents happen?
>>
>> Coming on to my main query for which I ask question in Lucene community:
>> What is the search performance benchmark against Lucene version, so that I
>> can benchmark my application throughput?
>>
>>
>>
>> On Tue, Jan 3, 2017 at 5:12 PM, Michael McCandless
>> <luc...@mikemccandless.com> wrote:
>>>
>>> When you add MUST sub-clauses to a BooleanQuery  (AND to the query
>>> parsers) it can make the search run faster because Lucene will take
>>> the most restrictive clause and use that to "drive" the iteration of
>>> matching documents to the other clauses, allowing those other clauses
>>> to iterate much faster than they would otherwise require if they were
>>> not AND'd.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj <rajnishk7.i...@gmail.com>
>>> wrote:
>>> > The answer is not clear.
>>> >
>>> > Suppose I have following query and I want 10 records.
>>> > Condition1 AND Condition2 AND Condition3
>>> >
>>> > As per my understanding Lucene will first evaluate all conditions
>>> > separately and then merge the Documents as per AND/OR clauses.
>>> > At last it will return me 10 records.
>>> >
>>> > So, if I add one more condition, then it will add to search time and
>>> > merge
>>> > time and hence increase latency, which results in decreased throughput.
>>> >
>>> >
>>> > Also, what is the search performance benchmark against Lucene version?
>>> >
>>> >
>>> > Regards
>>> > Rajnish
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tuesday 3 January 2017, Michael Wilkowski <m...@silenteight.com>
>>> > wrote:
>>> >
>>> >> My guess: more conditions = less documents to score and sort to
>>> >> return.
>>> >>
>>> >> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj
>>> >> <rajnishk7.i...@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > Hi
>>> >> >
>>> >> > Is there any Lucene performance benchmark against certain set of
>>> >> > data?
>>> >> > [i.e Is there any stats for search throughput which Lucene can
>>> >> > provide
>>> >> for
>>> >> > a certain data?]
>>> >> >
>>> >> > Search throughput Example:
>>> >> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with
>>> >> > SSD)
>>> >> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with
>>> >> > SSD)
>>> >> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with
>>> >> > SSD)
>>> >> > etc.
>>> >> >
>>> >> > Also, does the index size matters for search throughput?
>>> >> >
>>> >> > Our observation:
>>> >> > When we increase the data size (hence index size) the search
>>> >> > throughput
>>> >> > decreases.
>>> >> > When we add more AND conditions, the search throughput increases.
>>> >> > Why?
>>> >> > Ideally if we add more conditions then the Lucene should have more
>>> >> > work
>>> >> to
>>> >> > do (including merging) and the throughput should decrease but the
>>> >> > throughput increases?
>>> >> >
>>> >> >
>>> >> > Regards
>>> >> > Rajnish
>>> >> >
>>> >>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene performance benchmark | search throughput

Reply via email to