Re: Optimizing a boolean query for 100s of term clauses

Alex K Thu, 25 Jun 2020 15:26:53 -0700

Hi Tommaso, thanks for the input and links! I'll add your paper to my
literature review.


So far I've seen very promising results from modifying the TermInSetQuery.
It was pretty simple to keep a map of `doc id -> matched term count` and
then only evaluate the exact similarity on the top k doc ids.
On a small benchmark, I was able to drop the time for 1000 queries from 45
seconds to 14 seconds.
Now the bottleneck is back in my own code, which I'm happy with because I
can optimize that more easily.
Hopefully I can merge these changes in the next couple days, and I'll post
the diff when I do.

- AK



On Thu, Jun 25, 2020 at 5:07 AM Tommaso Teofili <[email protected]>
wrote:

> hi Alex,
>
> I had worked on a similar problem directly on Lucene (within Anserini
> toolkit) using LSH fingerprints of tokenized feature vector values.
> You can find code at [1] and some information on the Anserini documentation
> page [2] and in a short preprint [3].
> As a side note my current thinking is that it would be very cool if we
> could leverage Lucene N dimensional point support by properly reducing the
> dimensionality of the original vectors, however that is hard to do without
> losing important information.
>
> My 2 cents,
> Tommaso
>
> [1] :
>
> https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
> [2] :
>
> https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
> [3] : https://arxiv.org/abs/1910.10208
>
>
>
>
>
> On Wed, 24 Jun 2020 at 19:47, Alex K <[email protected]> wrote:
>
> > Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> > space!
> >
> > My implementation isn't specific to any particular dataset or access
> > pattern (i.e. infinite vs. subset).
> > So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> > similarities with LSH variants for all but L1.
> > My exact implementation is generally faster than the approximate LSH
> > implementation, hence the thread.
> > You make a good point that this is valuable by itself if you're able to
> > filter down to a small subset of docs.
> > I put a lot of work into optimizing the vector serialization speed and
> the
> > exact query execution.
> > I imagine with my current implementation there is some breaking point
> where
> > LSH becomes faster than exact, but so far I've tested with ~1.2M
> > ~300-dimensional vectors and exact is still faster, especially when
> > parallelized across many shards.
> > So speeding up LSH is the current engineering challenge.
> >
> > Are you using Elasticsearch or Lucene directly?
> > If you're using ES and have the time, I'd love some feedback on my
> plugin.
> > It sounds like you want to compute hamming similarity on your bitmaps?
> > If so that's currently supported.
> > There's an example here:
> > http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
> >
> > Also I've compiled a small literature review on some related research
> here:
> >
> >
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> > *Fast and Exact NNS in Hamming Space on Full-Text Search Engines*
> describes
> > some clever tricks to speed up Hamming similarity.
> > *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> > using the largest absolute magnitude values instead of the full vector.
> > Perhaps you've already read them but I figured I'd share.
> >
> > Cheers
> > - AK
> >
> >
> >
> > On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen <[email protected]> wrote:
> >
> > > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > > allows users to index numerical vectors and run exact and approximate
> > > > k-nearest-neighbors similarity queries.
> > >
> > > Quite a coincidence. I'm looking into the same thing :-)
> > >
> > > >   1. When indexing a vector, apply a hash function to it, producing
> > > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > > hashes.
> > >
> > > Is it important to have "infinite" scaling with inverted index or is it
> > > acceptable to have a (fast) sequential scan through all documents? If
> > > the use case is to combine the nearest neighbour search with other
> > > filters, so that the effective search-space is relatively small, you
> > > could go directly to computing the Euclidian distance (or whatever you
> > > use to calculate the exact similarity score).
> > >
> > > >   4. As the BooleanQuery produces results, maintain a fixed-size
> > > > heap of its scores. For any score exceeding the min in the heap, load
> > > > its vector from the binary doc values, compute the exact similarity,
> > > > and update the heap.
> > >
> > > I did something quite similar for a non-Lucene bases proof of concept,
> > > except that I delayed the exact similarity calculation and over-
> > > collected on the heap.
> > >
> > > Fleshing that out: Instead of producing similarity hashes, I extracted
> > > the top-X strongest signals (entries in the vector) and stored them as
> > > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> > > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> > > possible", just with indexes instead of hashes.
> > >
> > > >    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > > > runtime)
> > >
> > > This sounds strange. How large is your queue? Object-based priority
> > > queues tend to become slow when they get large (100K+ values).
> > >
> > > > Maybe I could optimize this by implementing a custom query or scorer?
> > >
> > > My plan for a better implementation is to use an autoencoder to produce
> > > a condensed representation of the raw vector for a document. In order
> > > to do so, a network must be trained on (ideally) the full corpus, so it
> > > will require a bootstrap process and will probably work poorly if
> > > incoming vectors differ substantially in nature from the existing ones
> > > (at least until the autoencoder is retrained and the condensed
> > > representations are reindexed). As our domain is an always growing
> > > image collection with fairly defines types of images (portraits, line
> > > drawings, maps...) and since new types are introduced rarely, this is
> > > acceptable for us.
> > >
> > > Back to Lucene, the condensed representation is expected to be a bitmap
> > > where the (coarse) similarity between two representations is simply the
> > > number of set bits at the same locations: An AND and a POPCNT of the
> > > bitmaps.
> > >
> > > This does imply a sequential pass of all potential documents, which
> > > means that it won't scale well. On the other hand each comparison is a
> > > fast check with very low memory overhead, so I hope it will work for
> > > the few million images that we need it for.
> > >
> > > - Toke Eskildsen, Royal Danish Library
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to