Hello,

We added deduplication to Solr's QueryComponent by overriding it at some
places. We are using minhashes for fuzzy deduplication, but it'll work with
any kind of signature field.

We did this:
* override createMainQuery() so we can add a parameter that controls which
is the signature field, and a parameter (factor) that controls how many
extra results to fetch. The latter is used for cases where you ask the 10
first rows, but most are duplicate so you get just 5 first results instead.

* override mergeIds() where all results are collected. Here we used a
custom ShardDoc object to hold the minhash field. We assume that everything
is sorted by score and so that duplicates will arrive in order. When
collecting results in the PriorityQueue, we only need to compare the
signatures of the current and previously processed field.

This has some obvious flaws, it can be hard to get good full result sets as
requested. For our case, we assume (and know by measurements) that users
almost never page deeper than the first page. So we keep a low
overRequestFactor. This rarely causes issues where most of the initial
results set is almost empty because of removed duplicates. I don't know
what happens when you page very deeply, it would probably break something.

It also has some benefits, it is quite fast as it only compares a few top N
signatures. It can be fuzzy using minhashes. It does not depend documents
being routed to specific shards.

Op di 24 sep 2024 om 10:50 schreef Dan Rosher <dan.ros...@fruugo.com>:

> Hello Everyone,We have 3 shards, with skus linked to merchants. We don't
> currently, but could co-locate skus for a specific merchant on the same
> shard with document routing, and then dedup similar skus for the same
> merchant. But similar skus, that should be deduped can appear for different
> merchants (and then on different shards) and I know that the collapse
> filter doesn't work across shards.Has anyone else came across a similar
> issue ( I doubt this is unique), and I was wondering how they dealt with
> deduping documents on different shards?
>
> Many thanks,
> Dan
>
> --
>
>
>
>
> Fruugo​​.​​com Ltd
>
> Registered in England & Wales. Registered number:
> 06553460. VAT number: GB 413 9004 29. Registered office: Fountain Street
> House, Ulverston, LA12 7EQ.
>
> This email may be confidential and privileged.
> If you received this communication by mistake, please don't forward it to
> anyone else, please erase all copies and attachments, and please let me
> know that it has gone to the wrong person.
>
>
>

Reply via email to