Re: Optimizing a boolean query for 100s of term clauses

Alex K Tue, 23 Jun 2020 17:52:31 -0700

The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
return the number of terms that matched in a given document. Rather it just
returns the boost value. I'll look into copying/modifying the internals to
return the number of matched terms.


Thanks
- AK

On Tue, Jun 23, 2020 at 3:17 PM Alex K <[email protected]> wrote:

> Hi Michael,
> Thanks for the quick response!
>
> I will look into the TermInSetQuery.
>
> My usage of "heap" might've been confusing.
> I'm using a FunctionScoreQuery from Elasticsearch.
> This gets instantiated with a Lucene query, in this case the boolean query
> as I described it, as well as a custom ScoreFunction object.
> The ScoreFunction exposes a single method that takes a doc id and the
> BooleanQuery score for that doc id, and returns another score.
> In that method I use a MinMaxPriorityQueue from the Guava library to
> maintain a fixed-capacity subset of the highest-scoring docs and evaluate
> exact similarity on them.
> Once the queue is at capacity, I just return 0 for any docs that had a
> boolean query score smaller than the min in the queue.
>
> But you can actually forget entirely that this ScoreFunction exists. It
> only contributes ~6% of the runtime.
> Even if I only use the BooleanQuery by itself, I still see the same
> behavior and bottlenecks.
>
> Thanks
> - AK
>
>
> On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov <[email protected]>
> wrote:
>
>> You might consider using a TermInSetQuery in place of a BooleanQuery
>> for the hashes (since they are all in the same field).
>>
>> I don't really understand why you are seeing so much cost in the heap
>> - it's sounds as if you have a single heap with mixed scores - those
>> generated by the BooleanQuery and those generated by the vector
>> scoring operation. Maybe you comment a little more on the interaction
>> there - are there really two heaps? Do you override the standard
>> collector?
>>
>> On Tue, Jun 23, 2020 at 9:51 AM Alex K <[email protected]> wrote:
>> >
>> > Hello all,
>> >
>> > I'm working on an Elasticsearch plugin (using Lucene internally) that
>> > allows users to index numerical vectors and run exact and approximate
>> > k-nearest-neighbors similarity queries.
>> > I'd like to get some feedback about my usage of BooleanQueries and
>> > TermQueries, and see if there are any optimizations or performance
>> tricks
>> > for my use case.
>> >
>> > An example use case for the plugin is reverse image search. A user can
>> > store vectors representing images and run a nearest-neighbors query to
>> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
>> > More detailed documentation here: http://elastiknn.klibisz.com/
>> >
>> > The main method for indexing the vectors is based on Locality Sensitive
>> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
>> > The general pattern is:
>> >
>> >    1. When indexing a vector, apply a hash function to it, producing a
>> set
>> >    of discrete hashes. Usually there are anywhere from 100 to 1000
>> hashes.
>> >    Similar vectors are more likely to share hashes (i.e., similar
>> vectors
>> >    produce hash collisions).
>> >    2. Convert each hash to a byte array and store the byte array as a
>> >    Lucene Term at a specific field.
>> >    3. Store the complete vector (i.e. floating point numbers) in a
>> binary
>> >    doc values field.
>> >
>> > In other words, I'm converting each vector into a bag of words, though
>> the
>> > words have no semantic meaning.
>> >
>> > A query works as follows:
>> >
>> >    1. Given a query vector, apply the same hash function to produce a
>> set
>> >    of hashes.
>> >    2. Convert each hash to a byte array and create a Term.
>> >    3. Build and run a BooleanQuery with a clause for each Term. Each
>> clause
>> >    looks like this: `new BooleanClause(new ConstantScoreQuery(new
>> >    TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
>> >    BooleanClause.Occur.SHOULD))`.
>> >    4. As the BooleanQuery produces results, maintain a fixed-size heap
>> of
>> >    its scores. For any score exceeding the min in the heap, load its
>> vector
>> >    from the binary doc values, compute the exact similarity, and update
>> the
>> >    heap. Otherwise the vector gets a score of 0.
>> >
>> > When profiling my benchmarks with VisualVM, I've found the Elasticsearch
>> > search threads spend > 50% of the runtime in these two methods:
>> >
>> >    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
>> runtime)
>> >    - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8%
>> of
>> >    runtime)
>> >
>> > So the time seems to be dominated by collecting and ordering the results
>> > produced by the BooleanQuery from step 3 above.
>> > The exact similarity computation is only about 15% of the runtime. If I
>> > disable it entirely, I still see the same bottlenecks in VisualVM.
>> > Reducing the number of hashes yields roughly linear scaling (i.e., 400
>> > hashes take ~2x longer than 200 hashes).
>> >
>> > The use case seems different to text search in that there's no semantic
>> > meaning to the terms, their length, their ordering, their stems, etc.
>> > I basically just need the index to be a rudimentary HashMap, and I only
>> > care about the scores for the top k results.
>> > With that in mind, I've made the following optimizations:
>> >
>> >    - Disabled tokenization on the FieldType (setTokenized(false))
>> >    - Disabled norms on the FieldType (setOmitNorms(true))
>> >    - Set similarity to BooleanSimilarity on the elasticsearch
>> >    MappedFieldType
>> >    - Set index options to IndexOptions.Docs.
>> >    - Used the MoreLikeThis heuristic to pick a subset of terms. This
>> >    understandably only yields a speedup proportional to the number of
>> >    discarded terms.
>> >
>> > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
>> > The main query implementation is here
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
>> >
>> > .
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala
>> >
>> > The actual query that gets executed by Elasticsearch is instantiated on
>> line
>> > 98
>> > <
>> https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98
>> >
>> > .
>> > It's in Scala but all of the Java query classes should look familiar.
>> >
>> > Maybe there are some settings that I'm not aware of?
>> > Maybe I could optimize this by implementing a custom query or scorer?
>> > Maybe there's just no way to speed this up?
>> >
>> > I appreciate any input, examples, links, etc.. :)
>> > Also, let me know if I can provide any additional details.
>> >
>> > Thanks,
>> > Alex Klibisz
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to