Re: Optimizing a boolean query for 100s of term clauses

Tommaso Teofili Thu, 25 Jun 2020 02:08:24 -0700

hi Alex,

I had worked on a similar problem directly on Lucene (within Anserini
toolkit) using LSH fingerprints of tokenized feature vector values.
You can find code at [1] and some information on the Anserini documentation
page [2] and in a short preprint [3].
As a side note my current thinking is that it would be very cool if we
could leverage Lucene N dimensional point support by properly reducing the
dimensionality of the original vectors, however that is hard to do without
losing important information.


My 2 cents,
Tommaso

[1] :
https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
[2] :
https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
[3] : https://arxiv.org/abs/1910.10208





On Wed, 24 Jun 2020 at 19:47, Alex K <[email protected]> wrote:

> Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> space!
>
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).
> So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> similarities with LSH variants for all but L1.
> My exact implementation is generally faster than the approximate LSH
> implementation, hence the thread.
> You make a good point that this is valuable by itself if you're able to
> filter down to a small subset of docs.
> I put a lot of work into optimizing the vector serialization speed and the
> exact query execution.
> I imagine with my current implementation there is some breaking point where
> LSH becomes faster than exact, but so far I've tested with ~1.2M
> ~300-dimensional vectors and exact is still faster, especially when
> parallelized across many shards.
> So speeding up LSH is the current engineering challenge.
>
> Are you using Elasticsearch or Lucene directly?
> If you're using ES and have the time, I'd love some feedback on my plugin.
> It sounds like you want to compute hamming similarity on your bitmaps?
> If so that's currently supported.
> There's an example here:
> http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
>
> Also I've compiled a small literature review on some related research here:
>
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> *Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes
> some clever tricks to speed up Hamming similarity.
> *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> using the largest absolute magnitude values instead of the full vector.
> Perhaps you've already read them but I figured I'd share.
>
> Cheers
> - AK
>
>
>
> On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen <[email protected]> wrote:
>
> > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > allows users to index numerical vectors and run exact and approximate
> > > k-nearest-neighbors similarity queries.
> >
> > Quite a coincidence. I'm looking into the same thing :-)
> >
> > >   1. When indexing a vector, apply a hash function to it, producing
> > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > hashes.
> >
> > Is it important to have "infinite" scaling with inverted index or is it
> > acceptable to have a (fast) sequential scan through all documents? If
> > the use case is to combine the nearest neighbour search with other
> > filters, so that the effective search-space is relatively small, you
> > could go directly to computing the Euclidian distance (or whatever you
> > use to calculate the exact similarity score).
> >
> > >   4. As the BooleanQuery produces results, maintain a fixed-size
> > > heap of its scores. For any score exceeding the min in the heap, load
> > > its vector from the binary doc values, compute the exact similarity,
> > > and update the heap.
> >
> > I did something quite similar for a non-Lucene bases proof of concept,
> > except that I delayed the exact similarity calculation and over-
> > collected on the heap.
> >
> > Fleshing that out: Instead of producing similarity hashes, I extracted
> > the top-X strongest signals (entries in the vector) and stored them as
> > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> > possible", just with indexes instead of hashes.
> >
> > >    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > > runtime)
> >
> > This sounds strange. How large is your queue? Object-based priority
> > queues tend to become slow when they get large (100K+ values).
> >
> > > Maybe I could optimize this by implementing a custom query or scorer?
> >
> > My plan for a better implementation is to use an autoencoder to produce
> > a condensed representation of the raw vector for a document. In order
> > to do so, a network must be trained on (ideally) the full corpus, so it
> > will require a bootstrap process and will probably work poorly if
> > incoming vectors differ substantially in nature from the existing ones
> > (at least until the autoencoder is retrained and the condensed
> > representations are reindexed). As our domain is an always growing
> > image collection with fairly defines types of images (portraits, line
> > drawings, maps...) and since new types are introduced rarely, this is
> > acceptable for us.
> >
> > Back to Lucene, the condensed representation is expected to be a bitmap
> > where the (coarse) similarity between two representations is simply the
> > number of set bits at the same locations: An AND and a POPCNT of the
> > bitmaps.
> >
> > This does imply a sequential pass of all potential documents, which
> > means that it won't scale well. On the other hand each comparison is a
> > fast check with very low memory overhead, so I hope it will work for
> > the few million images that we need it for.
> >
> > - Toke Eskildsen, Royal Danish Library
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to