Hello, We added deduplication to Solr's QueryComponent by overriding it at some places. We are using minhashes for fuzzy deduplication, but it'll work with any kind of signature field.
We did this: * override createMainQuery() so we can add a parameter that controls which is the signature field, and a parameter (factor) that controls how many extra results to fetch. The latter is used for cases where you ask the 10 first rows, but most are duplicate so you get just 5 first results instead. * override mergeIds() where all results are collected. Here we used a custom ShardDoc object to hold the minhash field. We assume that everything is sorted by score and so that duplicates will arrive in order. When collecting results in the PriorityQueue, we only need to compare the signatures of the current and previously processed field. This has some obvious flaws, it can be hard to get good full result sets as requested. For our case, we assume (and know by measurements) that users almost never page deeper than the first page. So we keep a low overRequestFactor. This rarely causes issues where most of the initial results set is almost empty because of removed duplicates. I don't know what happens when you page very deeply, it would probably break something. It also has some benefits, it is quite fast as it only compares a few top N signatures. It can be fuzzy using minhashes. It does not depend documents being routed to specific shards. Op di 24 sep 2024 om 10:50 schreef Dan Rosher <dan.ros...@fruugo.com>: > Hello Everyone,We have 3 shards, with skus linked to merchants. We don't > currently, but could co-locate skus for a specific merchant on the same > shard with document routing, and then dedup similar skus for the same > merchant. But similar skus, that should be deduped can appear for different > merchants (and then on different shards) and I know that the collapse > filter doesn't work across shards.Has anyone else came across a similar > issue ( I doubt this is unique), and I was wondering how they dealt with > deduping documents on different shards? > > Many thanks, > Dan > > -- > > > > > Fruugo.com Ltd > > Registered in England & Wales. Registered number: > 06553460. VAT number: GB 413 9004 29. Registered office: Fountain Street > House, Ulverston, LA12 7EQ. > > This email may be confidential and privileged. > If you received this communication by mistake, please don't forward it to > anyone else, please erase all copies and attachments, and please let me > know that it has gone to the wrong person. > > >