Thanks for the quick reply, Chris. Yes, when I say "duplicate" sentences, they are exact copies of the same string.
The MD5 hash is a good idea, I wish I had thought of it earlier as it would have saved me a lot of trouble. Right now it is not feasible to reindex again because indexing is a very slow and cpu intensive task for me. I'm adding part-of-speech, chunk, named entity and coreference information as I index, which means it takes 4 separate servers and 4-5 days of processing to create a new index. And as far as I know, you can't change the index once its created. Am I correct? Any other ideas that don't require me to re-index the whole thing? Quoting Chris Lamprecht <[EMAIL PROTECTED]>: > Dave, > > Can you define exactly what you consider "duplicate sentences"? Is it > the same exact string, or the same words in the same order, or the > same words in any order, etc? > > If you can normalize each sentence first, so two "duplicate" sentences > are always the exact same string, then you should be able to fit a > hash of each sentence in RAM. By hash, I mean a > cryptographic-strength hash, such as MD5. The big difference between > an MD5 hash and a hashCode() hash is you won't get collisions with the > MD5 hash. (More accurately, the probability of two different objects > having the same MD5 hash is astronomically low). Each MD5 takes 16 > bytes, so 16 * 25M sentences = 400MB of memory. You can probably get > away with a smaller hash in your case, maybe half of an MD5 (64 bits), > saving you RAM. You could also store this MD5 in your lucene index as > a keyword field (after converting the 16-byte MD5 value into a 32-byte > hex string, for example), providing a fast way to find duplicates at > search time. > > If you can give more details on your requirements, people in this list > can probably come up with some pretty good solutions. > > -chris > > On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I would like to poll the community's opinion on good strategies for > identifying > > duplicate documents in a lucene index. > > > > You see, I have an index containing roughly 25 million lucene documents. My > task > > requires me to work at sentence level so each lucene document actually > contains > > exactly one sentence. The issue I have right now is that sometimes, certain > > sentences are duplicated and I'ld like to be able to identify them as a > BitSet > > so that I can filter away these duplicates in my search. > > > > Obviously the brute force method of pairwise compares would take forever. I > have > > tried grouping sentences using their hashCodes() and then do a pairwise > compare > > between sentences that has the same hashCode, but even with a 1GB heap I > ran > > out of memory after comparing 200k sentences. > > > > Any other ideas? > > > > > > Regards > > Dave Kor. > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]