Dave, Can you define exactly what you consider "duplicate sentences"? Is it the same exact string, or the same words in the same order, or the same words in any order, etc?
If you can normalize each sentence first, so two "duplicate" sentences are always the exact same string, then you should be able to fit a hash of each sentence in RAM. By hash, I mean a cryptographic-strength hash, such as MD5. The big difference between an MD5 hash and a hashCode() hash is you won't get collisions with the MD5 hash. (More accurately, the probability of two different objects having the same MD5 hash is astronomically low). Each MD5 takes 16 bytes, so 16 * 25M sentences = 400MB of memory. You can probably get away with a smaller hash in your case, maybe half of an MD5 (64 bits), saving you RAM. You could also store this MD5 in your lucene index as a keyword field (after converting the 16-byte MD5 value into a 32-byte hex string, for example), providing a fast way to find duplicates at search time. If you can give more details on your requirements, people in this list can probably come up with some pretty good solutions. -chris On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote: > Hi, > > I would like to poll the community's opinion on good strategies for > identifying > duplicate documents in a lucene index. > > You see, I have an index containing roughly 25 million lucene documents. My > task > requires me to work at sentence level so each lucene document actually > contains > exactly one sentence. The issue I have right now is that sometimes, certain > sentences are duplicated and I'ld like to be able to identify them as a BitSet > so that I can filter away these duplicates in my search. > > Obviously the brute force method of pairwise compares would take forever. I > have > tried grouping sentences using their hashCodes() and then do a pairwise > compare > between sentences that has the same hashCode, but even with a 1GB heap I ran > out of memory after comparing 200k sentences. > > Any other ideas? > > > Regards > Dave Kor. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]