Unfortunately in my architecture I cannot rely on a database and on a
updated/created
time field. There is a potentially infinite stream of documents with a
possible huge amount of duplication.
So avoid the indexing of the duplicate documents (I suppose) should improve
the performance.

On Fri, 5 Aug 2022 at 01:10, Dave <hastings.recurs...@gmail.com> wrote:

> ——
>
> At this point it would be interesting to see how this Processor would
> increase the indexing performance when you have many duplicates
>
> - when it comes to indexing performance with duplicates, there isn’t any
> difference than a new document. It’s mark as original destroyed, and new
> one replaces.  Update isn’t a real thing, and the first operation is pretty
> much a joke speed wise and the second is as fast as indexing, and solr will
> manage the segments as needed when it determines to do so.  Your best bet
> is to manage this code wise. Have an updated/created time field and when
> indexing only run on those that fits your automated schedule against such
> fields.  In a database this takes like 5 minutes to write into your
> indexer, and I can promise you will be faster than trying to use a built in
> solr operation to figure it out for you.
>
> If I’m wrong I would love to know, but indexing code logic will always be
> faster than relying on a built in server function for these sorts of
> things.
>
>
>
>
>
> > On Aug 4, 2022, at 6:41 PM, Vincenzo D'Amore <v.dam...@gmail.com> wrote:
> >
> >
> > At this point it would be interesting to see how this Processor would
> > increase the indexing performance when you have many duplicates
>
-- 
Vincenzo D'Amore

Reply via email to