Re: Lucene performance benchmark | search throughput

Michael McCandless Tue, 17 Jan 2017 04:42:53 -0800

In your 2nd test, the number of hits was still 25K, even though you
added another 1M docs to the "general" data set?  If not, then the
query needed to do more work and will run slower.


If so, the query still does need to do more work in order to skip over
the "gold" documents: that skipping (the advance method in the
scorers/postings enums) definitely has some cost, and having to skip
over more documents will mean more cost.

That said, a drop from 130 TPS to 58 TPS on only increase index by 33% is odd.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 17, 2017 at 5:33 AM, Rajnish kamboj
<rajnishk7.i...@gmail.com> wrote:
> Hi
>
> We have modified our search query around most restrictive dataset, and as
> expected the search performance increases.
> BUT, if we increase the total data volume our search performance decreases,
> despite of same query and restrictive dataset.
>
> Example:
> Total Dataset: 3 Million 25K
> -----------------------------------------
> General Dataset: 3 Million
>     Fields:
>     Field: state, Value: Available
>     Field: type, Value: Gold
>
> Restrictive Dataset: 25K
>     Fields:
>     Field: state, Value: Available
>     Field: type, Value: Silver
> --------------------------------------------
>
> So, out of 3 Million 25K data, only 25K data is having field "type" as value
> "Silver"
>
> Search Query:  "state:Available AND type:Silver"
> As expected the performance increases from 7 TPS to 130 TPS.
>
> BUT, as we add 1 Million more to General Dataset, the performance decreases
> to 58 TPS.
> This is strange and contradicts to following comments:
>
> "search run faster because Lucene will take the most restrictive clause and
> use that to 'drive' the iteration of matching documents to the other
> clauses"
>
> Complete comments in this Link:
> "http://mail-archives.apache.org/mod_mbox/lucene-java-user/201701.mbox/raw/%3CCAL8Pwka4RC3c%2B%3D9wapH2%2B_JKSimnHmgRs3Eq6O_e7HrF6iBXPw%40mail.gmail.com%3E/";
>
>
> Why the performance decreases with increase in data volume, despite of same
> query and restrictive dataset?
>
>
> Regards
> Rajnish
>
>
>
> On Fri, Jan 6, 2017 at 6:09 PM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>>
>> The cost() method on DocIdSetIterator is responsible for telling
>> BooleanQuery how costly that clause is, and how cost() is implemented
>> varies by query.
>>
>> For the multi-term queries, like WildcardQuery, Lucene will first
>> visit all matched terms (during the Query.rewrite phase), and rewrite
>> the query either into a disjunction (SHOULD of the N terms), or it
>> will, per segment, visit all docs for all matching terms, setting them
>> in a sparse or dense bitset, recording the cost as the number of
>> documents.
>>
>> But there is work underway now to try to improve the multi-term query
>> cases so that we don't go and do all that up-front work (visiting all
>> terms, and all docs matching each term) when another clause in the
>> boolean query is more restrictive:
>> https://issues.apache.org/jira/browse/LUCENE-7055
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jan 6, 2017 at 2:28 AM, Rajnish kamboj <rajnishk7.i...@gmail.com>
>> wrote:
>> > OK, got it
>> >
>> > One thing still I need to know (which is not clear to me)....
>> > How does Lucene calculates the most restrictive clause?
>> >
>> > Correct me, if I am wrong in my understanding (in abstract):
>> > 1. During indexing, Lucene keeps information of documents count against
>> > every indexed items.
>> > 2. During search, it first checks, which condition has less number of
>> > documents count before actually iterating.
>> > 3. Then, it iterates that restricted set against other set of
>> > conditions.
>> >
>> > If the above is correct then how does Lucene calculates most restrictive
>> > clause in case of Wildcard conditions?
>> > Also, if Lucene first check for most restrictive clause, and then
>> > iterate to
>> > match documents to the other clauses,
>> >         Then when will the merging of documents happen?
>> >
>> > Coming on to my main query for which I ask question in Lucene community:
>> > What is the search performance benchmark against Lucene version, so that
>> > I
>> > can benchmark my application throughput?
>> >
>> >
>> > Regards
>> > Rajnish
>> >
>> > On Tue, Jan 3, 2017 at 6:09 PM, Rajnish kamboj
>> > <rajnishk7.i...@gmail.com>
>> > wrote:
>> >>
>> >> OK, got it
>> >>
>> >> One thing still I need to know (which is not clear to me)....
>> >> How does Lucene calculates the most restrictive clause?
>> >>
>> >> Correct me, if I am wrong in my understanding (in abstract):
>> >> 1. During indexing, Lucene keeps information of documents count against
>> >> every indexed items.
>> >> 2. During search, it first checks, which condition has less number of
>> >> documents count before actually iterating.
>> >> 3. Then, it iterates that restricted set against other set of
>> >> conditions.
>> >>
>> >> If the above is correct then how does Lucene calculates most
>> >> restrictive
>> >> clause in case of Wildcard conditions?
>> >> Also, if Lucene first check for most restrictive clause, and then
>> >> iterate
>> >> to match documents to the other clauses,
>> >>         Then when will the merging of documents happen?
>> >>
>> >> Coming on to my main query for which I ask question in Lucene
>> >> community:
>> >> What is the search performance benchmark against Lucene version, so
>> >> that I
>> >> can benchmark my application throughput?
>> >>
>> >>
>> >>
>> >> On Tue, Jan 3, 2017 at 5:12 PM, Michael McCandless
>> >> <luc...@mikemccandless.com> wrote:
>> >>>
>> >>> When you add MUST sub-clauses to a BooleanQuery  (AND to the query
>> >>> parsers) it can make the search run faster because Lucene will take
>> >>> the most restrictive clause and use that to "drive" the iteration of
>> >>> matching documents to the other clauses, allowing those other clauses
>> >>> to iterate much faster than they would otherwise require if they were
>> >>> not AND'd.
>> >>>
>> >>> Mike McCandless
>> >>>
>> >>> http://blog.mikemccandless.com
>> >>>
>> >>>
>> >>> On Tue, Jan 3, 2017 at 6:33 AM, Rajnish kamboj
>> >>> <rajnishk7.i...@gmail.com>
>> >>> wrote:
>> >>> > The answer is not clear.
>> >>> >
>> >>> > Suppose I have following query and I want 10 records.
>> >>> > Condition1 AND Condition2 AND Condition3
>> >>> >
>> >>> > As per my understanding Lucene will first evaluate all conditions
>> >>> > separately and then merge the Documents as per AND/OR clauses.
>> >>> > At last it will return me 10 records.
>> >>> >
>> >>> > So, if I add one more condition, then it will add to search time and
>> >>> > merge
>> >>> > time and hence increase latency, which results in decreased
>> >>> > throughput.
>> >>> >
>> >>> >
>> >>> > Also, what is the search performance benchmark against Lucene
>> >>> > version?
>> >>> >
>> >>> >
>> >>> > Regards
>> >>> > Rajnish
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tuesday 3 January 2017, Michael Wilkowski <m...@silenteight.com>
>> >>> > wrote:
>> >>> >
>> >>> >> My guess: more conditions = less documents to score and sort to
>> >>> >> return.
>> >>> >>
>> >>> >> On Mon, Jan 2, 2017 at 7:23 PM, Rajnish kamboj
>> >>> >> <rajnishk7.i...@gmail.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >> > Hi
>> >>> >> >
>> >>> >> > Is there any Lucene performance benchmark against certain set of
>> >>> >> > data?
>> >>> >> > [i.e Is there any stats for search throughput which Lucene can
>> >>> >> > provide
>> >>> >> for
>> >>> >> > a certain data?]
>> >>> >> >
>> >>> >> > Search throughput Example:
>> >>> >> > Max. 200 TPS for 50K data on Lucene 5.3.1 on RHEL version x (with
>> >>> >> > SSD)
>> >>> >> > Max. 150 TPS for 100K data on Lucene 5.3.1 on RHEL version x
>> >>> >> > (with
>> >>> >> > SSD)
>> >>> >> > Max. 300 TPS for 50K data on Lucene 6.0.0 on RHEL version x (with
>> >>> >> > SSD)
>> >>> >> > etc.
>> >>> >> >
>> >>> >> > Also, does the index size matters for search throughput?
>> >>> >> >
>> >>> >> > Our observation:
>> >>> >> > When we increase the data size (hence index size) the search
>> >>> >> > throughput
>> >>> >> > decreases.
>> >>> >> > When we add more AND conditions, the search throughput increases.
>> >>> >> > Why?
>> >>> >> > Ideally if we add more conditions then the Lucene should have
>> >>> >> > more
>> >>> >> > work
>> >>> >> to
>> >>> >> > do (including merging) and the throughput should decrease but the
>> >>> >> > throughput increases?
>> >>> >> >
>> >>> >> >
>> >>> >> > Regards
>> >>> >> > Rajnish
>> >>> >> >
>> >>> >>
>> >>
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene performance benchmark | search throughput

Reply via email to