Re: Optimizing a boolean query for 100s of term clauses

Toke Eskildsen Wed, 24 Jun 2020 05:45:32 -0700

On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.


Quite a coincidence. I'm looking into the same thing :-)

>   1. When indexing a vector, apply a hash function to it, producing
> a set of discrete hashes. Usually there are anywhere from 100 to 1000
> hashes.

Is it important to have "infinite" scaling with inverted index or is it
acceptable to have a (fast) sequential scan through all documents? If
the use case is to combine the nearest neighbour search with other
filters, so that the effective search-space is relatively small, you
could go directly to computing the Euclidian distance (or whatever you
use to calculate the exact similarity score).

>   4. As the BooleanQuery produces results, maintain a fixed-size
> heap of its scores. For any score exceeding the min in the heap, load
> its vector from the binary doc values, compute the exact similarity,
> and update the heap.

I did something quite similar for a non-Lucene bases proof of concept,
except that I delayed the exact similarity calculation and over-
collected on the heap.

Fleshing that out: Instead of producing similarity hashes, I extracted
the top-X strongest signals (entries in the vector) and stored them as
indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
1, 20] are [0, 3, 5]. The query was similar to your "match as many as
possible", just with indexes instead of hashes.

>    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> runtime)

This sounds strange. How large is your queue? Object-based priority
queues tend to become slow when they get large (100K+ values).

> Maybe I could optimize this by implementing a custom query or scorer?

My plan for a better implementation is to use an autoencoder to produce
a condensed representation of the raw vector for a document. In order
to do so, a network must be trained on (ideally) the full corpus, so it
will require a bootstrap process and will probably work poorly if
incoming vectors differ substantially in nature from the existing ones
(at least until the autoencoder is retrained and the condensed
representations are reindexed). As our domain is an always growing
image collection with fairly defines types of images (portraits, line
drawings, maps...) and since new types are introduced rarely, this is
acceptable for us.

Back to Lucene, the condensed representation is expected to be a bitmap
where the (coarse) similarity between two representations is simply the
number of set bits at the same locations: An AND and a POPCNT of the
bitmaps.

This does imply a sequential pass of all potential documents, which
means that it won't scale well. On the other hand each comparison is a
fast check with very low memory overhead, so I hope it will work for
the few million images that we need it for.

- Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to