Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Alex K
Hi Tommaso, thanks for the input and links! I'll add your paper to my literature review. So far I've seen very promising results from modifying the TermInSetQuery. It was pretty simple to keep a map of `doc id -> matched term count` and then only evaluate the exact similarity on the top k doc ids.

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Toke Eskildsen
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote: > My implementation isn't specific to any particular dataset or access > pattern (i.e. infinite vs. subset). Without a clearly defined use case, I would say that the sequential scan approach is not the right one: As these things goes, someone will

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Tommaso Teofili
hi Alex, I had worked on a similar problem directly on Lucene (within Anserini toolkit) using LSH fingerprints of tokenized feature vector values. You can find code at [1] and some information on the Anserini documentation page [2] and in a short preprint [3]. As a side note my current thinking is

SynonymGraphFilter & fuzzy terms

2020-06-25 Thread Petra Staub
I am using the classic query parser in combination with the SynonymGraphFilter. This works fine (synonyms get expanded) but I noticed that it is not possible to generate fuzzy query terms i.e. querying expanded synonym terms with an edit distance. I there a possibility to achieve this? Example: S