On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > I'm working on an Elasticsearch plugin (using Lucene internally) that > allows users to index numerical vectors and run exact and approximate > k-nearest-neighbors similarity queries.
Quite a coincidence. I'm looking into the same thing :-) > 1. When indexing a vector, apply a hash function to it, producing > a set of discrete hashes. Usually there are anywhere from 100 to 1000 > hashes. Is it important to have "infinite" scaling with inverted index or is it acceptable to have a (fast) sequential scan through all documents? If the use case is to combine the nearest neighbour search with other filters, so that the effective search-space is relatively small, you could go directly to computing the Euclidian distance (or whatever you use to calculate the exact similarity score). > 4. As the BooleanQuery produces results, maintain a fixed-size > heap of its scores. For any score exceeding the min in the heap, load > its vector from the binary doc values, compute the exact similarity, > and update the heap. I did something quite similar for a non-Lucene bases proof of concept, except that I delayed the exact similarity calculation and over- collected on the heap. Fleshing that out: Instead of producing similarity hashes, I extracted the top-X strongest signals (entries in the vector) and stored them as indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12, 1, 20] are [0, 3, 5]. The query was similar to your "match as many as possible", just with indexes instead of hashes. > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of > runtime) This sounds strange. How large is your queue? Object-based priority queues tend to become slow when they get large (100K+ values). > Maybe I could optimize this by implementing a custom query or scorer? My plan for a better implementation is to use an autoencoder to produce a condensed representation of the raw vector for a document. In order to do so, a network must be trained on (ideally) the full corpus, so it will require a bootstrap process and will probably work poorly if incoming vectors differ substantially in nature from the existing ones (at least until the autoencoder is retrained and the condensed representations are reindexed). As our domain is an always growing image collection with fairly defines types of images (portraits, line drawings, maps...) and since new types are introduced rarely, this is acceptable for us. Back to Lucene, the condensed representation is expected to be a bitmap where the (coarse) similarity between two representations is simply the number of set bits at the same locations: An AND and a POPCNT of the bitmaps. This does imply a sequential pass of all potential documents, which means that it won't scale well. On the other hand each comparison is a fast check with very low memory overhead, so I hope it will work for the few million images that we need it for. - Toke Eskildsen, Royal Danish Library --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org