Checking for duplicates inside index

Hannes Carl Meyer Mon, 22 May 2006 14:47:05 -0700

Hi All,

I'm indexing ~10000 documents per day but since I'm getting a lot ofreal duplicates (100% the same document content) I want to check thecontent before indexing...

My idea is to create a checksum of the documents content and store itwithin document inside the index, before indexing a new document I willcompare the new documents

checksum with the ones in the index.

Is that a good idea? does someone have experiences with that method? anytools available?


Thank you and kind regards

Hannes

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Checking for duplicates inside index

Reply via email to