Hi Tommaso, thanks for the input and links! I'll add your paper to my
literature review.
So far I've seen very promising results from modifying the TermInSetQuery.
It was pretty simple to keep a map of `doc id -> matched term count` and
then only evaluate the exact similarity on the top k doc ids.
On Wed, 2020-06-24 at 13:46 -0400, Alex K wrote:
> My implementation isn't specific to any particular dataset or access
> pattern (i.e. infinite vs. subset).
Without a clearly defined use case, I would say that the sequential
scan approach is not the right one: As these things goes, someone will
hi Alex,
I had worked on a similar problem directly on Lucene (within Anserini
toolkit) using LSH fingerprints of tokenized feature vector values.
You can find code at [1] and some information on the Anserini documentation
page [2] and in a short preprint [3].
As a side note my current thinking is
I am using the classic query parser in combination with the SynonymGraphFilter.
This works fine (synonyms get expanded) but I noticed that it is not possible
to generate
fuzzy query terms i.e. querying expanded synonym terms with an edit distance.
I there a possibility to achieve this?
Example: S