Unfortunately in my architecture I cannot rely on a database and on a updated/created time field. There is a potentially infinite stream of documents with a possible huge amount of duplication. So avoid the indexing of the duplicate documents (I suppose) should improve the performance.
On Fri, 5 Aug 2022 at 01:10, Dave <hastings.recurs...@gmail.com> wrote: > —— > > At this point it would be interesting to see how this Processor would > increase the indexing performance when you have many duplicates > > - when it comes to indexing performance with duplicates, there isn’t any > difference than a new document. It’s mark as original destroyed, and new > one replaces. Update isn’t a real thing, and the first operation is pretty > much a joke speed wise and the second is as fast as indexing, and solr will > manage the segments as needed when it determines to do so. Your best bet > is to manage this code wise. Have an updated/created time field and when > indexing only run on those that fits your automated schedule against such > fields. In a database this takes like 5 minutes to write into your > indexer, and I can promise you will be faster than trying to use a built in > solr operation to figure it out for you. > > If I’m wrong I would love to know, but indexing code logic will always be > faster than relying on a built in server function for these sorts of > things. > > > > > > > On Aug 4, 2022, at 6:41 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote: > > > > > > At this point it would be interesting to see how this Processor would > > increase the indexing performance when you have many duplicates > -- Vincenzo D'Amore