Re: Lucene performance benchmark | search throughput

Rajnish kamboj Tue, 17 Jan 2017 02:34:11 -0800

Hi

We have modified our search query around most restrictive dataset, and as
expected the search performance increases.
BUT, if we increase the total data volume our search performance decreases,
despite of same query and restrictive dataset.


Example:
Total Dataset: 3 Million 25K
-----------------------------------------
General Dataset: 3 Million
    Fields:
    Field: state, Value: Available
    Field: type, Value: Gold

Restrictive Dataset: 25K
    Fields:
    Field: state, Value: Available
    Field: type, Value: Silver
--------------------------------------------

So, out of 3 Million 25K data, only 25K data is having field "type" as
value "Silver"

Search Query:  "state:Available AND type:Silver"
As expected the performance increases from 7 TPS to 130 TPS.

BUT, as we add 1 Million more to General Dataset, the performance decreases
to 58 TPS.
This is strange and contradicts to following comments:

"search run faster because Lucene will take the most restrictive clause and
use that to 'drive' the iteration of matching documents to the other
clauses"

Complete comments in this Link: "
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201701.mbox/raw/%3CCAL8Pwka4RC3c%2B%3D9wapH2%2B_JKSimnHmgRs3Eq6O_e7HrF6iBXPw%40mail.gmail.com%3E/
"


Why the performance decreases with increase in data volume, despite of same
query and restrictive dataset?


Regards
Rajnish



On Fri, Jan 6, 2017 at 6:09 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> The cost() method on DocIdSetIterator is responsible for telling
> BooleanQuery how costly that clause is, and how cost() is implemented
> varies by query.
>
> For the multi-term queries, like WildcardQuery, Lucene will first
> visit all matched terms (during the Query.rewrite phase), and rewrite
> the query either into a disjunction (SHOULD of the N terms), or it
> will, per segment, visit all docs for all matching terms, setting them
> in a sparse or dense bitset, recording the cost as the number of
> documents.
>
> But there is work underway now to try to improve the multi-term query
> cases so that we don't go and do all that up-front work (visiting all
> terms, and all docs matching each term) when another clause in the
> boolean query is more restrictive:
> https://issues.apache.org/jira/browse/LUCENE-7055
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Jan 6, 2017 at 2:28 AM, Rajnish kamboj <rajnishk7.i...@gmail.com>
> wrote:
> > OK, got it
> >
> > One thing still I need to know (which is not clear to me)....
> > How does Lucene calculates the most restrictive clause?
> >
> > Correct me, if I am wrong in my understanding (in abstract):
> > 1. During indexing, Lucene keeps information of documents count against
> > every indexed items.
> > 2. During search, it first checks, which condition has less number of
> > documents count before actually iterating.
> > 3. Then, it iterates that restricted set against other set of conditions.
> >
> > If the above is correct then how does Lucene calculates most restrictive
> > clause in case of Wildcard conditions?
> > Also, if Lucene first check for most restrictive clause, and then
> iterate to
> > match documents to the other clauses,
> >         Then when will the merging of documents happen?
> >
> > Coming on to my main query for which I ask question in Lucene community:
> > What is the search performance benchmark against Lucene version, so that
> I
> > can benchmark my application throughput?
> >
> >
> > Regards
> > Rajnish
> >
> > On Tue, Jan 3, 2017 at 6:09 PM, Rajnish kamboj <rajnishk7.i...@gmail.com
> >
> > wrote:
> >>
> >> OK, got it
> >>
> >> One thing still I need to know (which is not clear to me)....
> >> How does Lucene calculates the most restrictive clause?
> >>
> >> Correct me, if I am wrong in my understanding (in abstract):
> >> 1. During indexing, Lucene keeps information of documents count against
> >> every indexed items.
> >> 2. During search, it first checks, which condition has less number of
> >> documents count before actually iterating.
> >> 3. Then, it iterates that restricted set against other set of
> conditions.
> >>
> >> If the above is correct then how does Lucene calculates most restrictive
> >> clause in case of Wildcard conditions?
> >> Also, if Lucene first check for most restrictive clause, and then
> iterate
> >> to match documents to the other clauses,
> >>         Then when will the merging of documents happen?
> >>
> >> Coming on to my main query for which I ask question in Lucene community:
> >> What is the search performance benchmark against Lucene version, so
> that I
> >> can benchmark my application throughput?
> >>
> >>
> >>
> >> On Tue, Jan 3, 2017 at 5:12 PM, Michael McCandless
> >> <luc...@mikemccandless.com> wrote:
> >>>
> >>> When you add MUST sub-clauses to a BooleanQuery  (AND to the query
> >>> parsers) it can make the search run faster because Lucene will take
> >>> the most restrictive clause and use that to "drive" the iteration of
> >>> matching documents to the other clauses, allowing those other clauses
> >>> to iterate much faster than they would otherwise require if they were
> >>> not AND'd.
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>>
> >>> On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj <
> rajnishk7.i...@gmail.com>
> >>> wrote:
> >>> > The answer is not clear.
> >>> >
> >>> > Suppose I have following query and I want 10 records.
> >>> > Condition1 AND Condition2 AND Condition3
> >>> >
> >>> > As per my understanding Lucene will first evaluate all conditions
> >>> > separately and then merge the Documents as per AND/OR clauses.
> >>> > At last it will return me 10 records.
> >>> >
> >>> > So, if I add one more condition, then it will add to search time and
> >>> > merge
> >>> > time and hence increase latency, which results in decreased
> throughput.
> >>> >
> >>> >
> >>> > Also, what is the search performance benchmark against Lucene
> version?
> >>> >
> >>> >
> >>> > Regards
> >>> > Rajnish
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Tuesday 3 January 2017, Michael Wilkowski <m...@silenteight.com>
> >>> > wrote:
> >>> >
> >>> >> My guess: more conditions = less documents to score and sort to
> >>> >> return.
> >>> >>
> >>> >> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj
> >>> >> <rajnishk7.i...@gmail.com>
> >>> >> wrote:
> >>> >>
> >>> >> > Hi
> >>> >> >
> >>> >> > Is there any Lucene performance benchmark against certain set of
> >>> >> > data?
> >>> >> > [i.e Is there any stats for search throughput which Lucene can
> >>> >> > provide
> >>> >> for
> >>> >> > a certain data?]
> >>> >> >
> >>> >> > Search throughput Example:
> >>> >> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with
> >>> >> > SSD)
> >>> >> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x (with
> >>> >> > SSD)
> >>> >> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with
> >>> >> > SSD)
> >>> >> > etc.
> >>> >> >
> >>> >> > Also, does the index size matters for search throughput?
> >>> >> >
> >>> >> > Our observation:
> >>> >> > When we increase the data size (hence index size) the search
> >>> >> > throughput
> >>> >> > decreases.
> >>> >> > When we add more AND conditions, the search throughput increases.
> >>> >> > Why?
> >>> >> > Ideally if we add more conditions then the Lucene should have more
> >>> >> > work
> >>> >> to
> >>> >> > do (including merging) and the throughput should decrease but the
> >>> >> > throughput increases?
> >>> >> >
> >>> >> >
> >>> >> > Regards
> >>> >> > Rajnish
> >>> >> >
> >>> >>
> >>
> >>
> >
>

Re: Lucene performance benchmark | search throughput

Reply via email to