Hi All,
I'm indexing ~10000 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
My idea is to create a checksum of the documents content and store it
within document inside the index, before indexing a new document I will
compare the new documents
checksum with the ones in the index.
Is that a good idea? does someone have experiences with that method? any
tools available?
Thank you and kind regards
Hannes
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]