Re: Optimizing a boolean query for 100s of term clauses

Toke Eskildsen Thu, 25 Jun 2020 02:13:24 -0700

On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote:
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).


Without a clearly defined use case, I would say that the sequential
scan approach is not the right one: As these things goes, someone will
come along and ask for scaling into the billions of images. "Someone"
might be my organization BTW: We do have a web archive and finding
similar images in that would be quite useful.

> Are you using Elasticsearch or Lucene directly?

None at the moment, as the driving project is currently at hold until
fall (at the earliest), and it was paused when I was about to switch
from prototyping (https://github.com/kb-dk/fairly-similar) to real
implementation. Hopefully I can twist another project in the direction
of using the same technology. If not, I'll just have to do it on my own
time :-)

I was hoping to use it with Solr, with an expectation of introducing
the necessary lower level mechanisms (and & bitcount of binary content)
at the Lucene level. Failing that, maybe Lucene directly. Using
Elasticsearch is a bit of a challenge as we don't do it currently and
it would require it to be added to Operation's support list.

> If you're using ES and have the time, I'd love some feedback on my
> plugin.

Sorry, not at the moment. Too many balls in the air before summer
vacation starts. I hope to find the time in August. Your post was just
too relevant to ignore.

> Also I've compiled a small literature review on some related research
> here: 
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit

You are clearly way ahead of us and I'll shamelessly piggyback on your
findings. I skimmed your notes and they look extremely useful.

> Fast and Exact NNS in Hamming Space on Full-Text Search Engines
> describes some clever tricks to speed up Hamming similarity.

The autoencoder-approach produces bitmaps where each bit is a distinct
signal, so I guess comparison would be equivalent to binary Hamming
distance?

> Large Scale Image Retrieval with Elasticsearch describes the idea of
> using the largest absolute magnitude values instead of the full
> vector.

That approach was very promising in our local proof of concept.

> Perhaps you've already read them but I figured I'd share.

A few of them, but not all. And your notes on the articles are great.

Thanks,
Toke Eskildsen, Royal Danish Library



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to