On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote: > My implementation isn't specific to any particular dataset or access > pattern (i.e. infinite vs. subset).
Without a clearly defined use case, I would say that the sequential scan approach is not the right one: As these things goes, someone will come along and ask for scaling into the billions of images. "Someone" might be my organization BTW: We do have a web archive and finding similar images in that would be quite useful. > Are you using Elasticsearch or Lucene directly? None at the moment, as the driving project is currently at hold until fall (at the earliest), and it was paused when I was about to switch from prototyping (https://github.com/kb-dk/fairly-similar) to real implementation. Hopefully I can twist another project in the direction of using the same technology. If not, I'll just have to do it on my own time :-) I was hoping to use it with Solr, with an expectation of introducing the necessary lower level mechanisms (and & bitcount of binary content) at the Lucene level. Failing that, maybe Lucene directly. Using Elasticsearch is a bit of a challenge as we don't do it currently and it would require it to be added to Operation's support list. > If you're using ES and have the time, I'd love some feedback on my > plugin. Sorry, not at the moment. Too many balls in the air before summer vacation starts. I hope to find the time in August. Your post was just too relevant to ignore. > Also I've compiled a small literature review on some related research > here: > https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit You are clearly way ahead of us and I'll shamelessly piggyback on your findings. I skimmed your notes and they look extremely useful. > Fast and Exact NNS in Hamming Space on Full-Text Search Engines > describes some clever tricks to speed up Hamming similarity. The autoencoder-approach produces bitmaps where each bit is a distinct signal, so I guess comparison would be equivalent to binary Hamming distance? > Large Scale Image Retrieval with Elasticsearch describes the idea of > using the largest absolute magnitude values instead of the full > vector. That approach was very promising in our local proof of concept. > Perhaps you've already read them but I figured I'd share. A few of them, but not all. And your notes on the articles are great. Thanks, Toke Eskildsen, Royal Danish Library --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org