Hi All,

I'm indexing ~10000 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing...

My idea is to create a checksum of the documents content and store it within document inside the index, before indexing a new document I will compare the new documents
checksum with the ones in the index.

Is that a good idea? does someone have experiences with that method? any tools available?

Thank you and kind regards

Hannes

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to