Hi David, >> I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. >>
Do you mean 100% duplicates or some kind of similarity? >> Obviously the brute force method of pairwise compares would take forever. I have tried grouping sentences using their hashCodes() and then do a pairwise compare between sentences that has the same hashCode, but even with a 1GB heap I ran out of memory after comparing 200k sentences. >> If you are only after 100% duplicates, you are on the right track with hash code. You could encode the hash code of the strings into the index by adding it into a separate field - your analyzer must index numbers for this! Then, iterate over all tokens of that field, retrieving each document enumerator; wherever you find more than one document, do the pairwise comparision as usual. This way, you should never need to compare more than a few documents. All the best, Karsten -- Dr.-Ing. Karsten Konrad Research & Development DACOS Software GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken http://www.dacos.com Tel: ++49/ (0) 681 - 302 64834 Fax: ++49/ (0) 681 - 302 64827 -----Ursprüngliche Nachricht----- Von: Dave Kor [mailto:[EMAIL PROTECTED] Gesendet: Sonntag, 12. Juni 2005 16:38 An: java-user@lucene.apache.org Betreff: Ideas Needed - Finding Duplicate Documents Hi, I would like to poll the community's opinion on good strategies for identifying duplicate documents in a lucene index. You see, I have an index containing roughly 25 million lucene documents. My task requires me to work at sentence level so each lucene document actually contains exactly one sentence. The issue I have right now is that sometimes, certain sentences are duplicated and I'ld like to be able to identify them as a BitSet so that I can filter away these duplicates in my search. Obviously the brute force method of pairwise compares would take forever. I have tried grouping sentences using their hashCodes() and then do a pairwise compare between sentences that has the same hashCode, but even with a 1GB heap I ran out of memory after comparing 200k sentences. Any other ideas? Regards Dave Kor. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]