Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

Rob Audenaerde Tue, 13 Oct 2020 03:51:20 -0700

I reduced the benchmark as far as I could, and now got these results,
TermsInSet being a lot slower compared to the Terms/SHOULD.



BenchmarkOrQuery.benchmarkTerms       thrpt    5  190820.510 ± 16667.411  ops/s
BenchmarkOrQuery.benchmarkTermsInSet  thrpt    5  110548.345 ±  7490.169  ops/s


@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTerms(final MyState myState) {
    try {
        final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
        final BooleanQuery.Builder b = new BooleanQuery.Builder();

        for (final String role : myState.user.getAdditionalRoles()) {
            b.add(new TermQuery(new Term(roles, new BytesRef(role))),
BooleanClause.Occur.SHOULD);
        }
        searcher.count(b.build());

    } catch (final IOException e) {
        e.printStackTrace();
    }
}

@Fork(1)
@Measurement(iterations = 5, time = 10)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Benchmark
public void benchmarkTermsInSet(final MyState myState) {
    try {
        final IndexSearcher searcher =
myState.matchedReaders.getIndexSearcher();
        final Set<BytesRef> roles =
myState.user.getAdditionalRoles().stream().map(BytesRef::new).collect(Collectors.toSet());
        searcher.count(new TermInSetQuery(BenchmarkOrQuery.roles, roles));

    } catch (final IOException e) {
        e.printStackTrace();
    }
}


On Tue, Oct 13, 2020 at 11:56 AM Rob Audenaerde <rob.audenae...@gmail.com>
wrote:

> Hello Adrien,
>
> Thanks for the swift reply. I'll add the details:
>
> Lucene version: 8.6.2
>
> The restrictionQuery is indeed a conjunction, it allowes for a document to
> be a hit if the 'roles' field is empty as well. It's used within a
> bigger query builder; so maybe I did something else wrong. I'll rewrite the
> benchmark to just benchmark the TermsInSet and Terms.
>
> It never occurred (hah) to me to use Occur.FILTER, that is a good point to
> check as well.
>
> As you put it, I would expect the results to be very similar, as I do not
> react the 16 terms in the TermInSet. I'll let you know what I'll find.
>
> On Tue, Oct 13, 2020 at 11:48 AM Adrien Grand <jpou...@gmail.com> wrote:
>
>> Can you give us a few more details:
>>  - What version of Lucene are you testing?
>>  - Are you benchmarking "restrictionQuery" on its own, or its conjunction
>> with another query?
>>
>> You mentioned that you combine your "restrictionQuery" and the user query
>> with Occur.MUST, Occur.FILTER feels more appropriate for
>> "restrictionQuery"
>> since it should not contribute to scoring.
>>
>> TermsInSetQuery automatically executes like a BooleanQuery when the number
>> of clauses is less than 16, so I would not expect major performance
>> differences between a TermInSetQuery over less than 16 terms and a
>> BooleanQuery wrapped in a ConstantScoreQuery.
>>
>> On Tue, Oct 13, 2020 at 11:35 AM Rob Audenaerde <rob.audenae...@gmail.com
>> >
>> wrote:
>>
>> > Hello,
>> >
>> > I'm benchmarking an application which implements security on lucene by
>> > adding a multivalue field "roles". If the user has one of these roles,
>> he
>> > can find the document.
>> >
>> > I implemented this as a Boolean and query, added the original query and
>> the
>> > restriction with Occur.MUST.
>> >
>> > I'm having some performance issues when counting the index (>60M docs),
>> so
>> > I thought about tweaking this restriction-implementation.
>> >
>> > I set-up a benchmark like this:
>> >
>> > I generate 2M documents, Each document has a multi-value "roles" field.
>> The
>> > "roles" field in each document has 4 values, taken from (2,2,1000,100)
>> > unique values.
>> > The user has (1,1,2,1) values for roles (so, 1 out of the 2, for the
>> first
>> > role, 1 out of 2 for the second, 2 out of the 1000 for the third value,
>> and
>> > 1 / 100 for the fourth).
>> >
>> > I got a somewhat unexpected performance difference. At first, I
>> implemented
>> > the restriction query like this:
>> >
>> > for (final String role : roles) {
>> >     restrictionQuery.add(new TermQuery(new Term("roles", new
>> > BytesRef(role))), Occur.SHOULD);
>> > }
>> >
>> > I then switched to a TermInSetQuery, which I thought would be faster
>> > as it is using constant-scores.
>> >
>> > final Set<BytesRef> rolesSet =
>> > roles.stream().map(BytesRef::new).collect(Collectors.toSet());
>> > restrictionQuery.add(new TermInSetQuery("roles", rolesSet),
>> Occur.SHOULD);
>> >
>> >
>> > However, the TermInSetQuery has about 25% slower ops/s. Is that to
>> > be expected? I did not, as I thought the constant-scoring would be
>> faster.
>> >
>>
>>
>> --
>> Adrien
>>
>

Re: unexpected performance TermsQuery Occur.SHOULD vs TermsInSetQuery?

Reply via email to