Re: Ideas Needed - Finding Duplicate Documents

Dave Kor Sun, 12 Jun 2005 14:34:23 -0700

Thanks for the quick reply, Chris.

Yes, when I say "duplicate" sentences, they are exact copies of the same string.


The MD5 hash is a good idea, I wish I had thought of it earlier as it would have
saved me a lot of trouble. Right now it is not feasible to reindex again because
indexing is a very slow and cpu intensive task for me. I'm adding
part-of-speech, chunk, named entity and coreference information as I index,
which means it takes 4 separate servers and 4-5 days of processing to create a
new index. And as far as I know, you can't change the index once its created.
Am I correct?

Any other ideas that don't require me to re-index the whole thing?


Quoting Chris Lamprecht <[EMAIL PROTECTED]>:

> Dave,
>
> Can you define exactly what you consider "duplicate sentences"?  Is it
> the same exact string, or the same words in the same order, or the
> same words in any order, etc?
>
> If you can normalize each sentence first, so two "duplicate" sentences
> are always the exact same string, then you should be able to fit a
> hash of each sentence in RAM.  By hash, I mean a
> cryptographic-strength hash, such as MD5.  The big difference between
> an MD5 hash and a hashCode() hash is you won't get collisions with the
> MD5 hash.  (More accurately, the probability of two different objects
> having the same MD5 hash is astronomically low).  Each MD5 takes 16
> bytes, so 16 * 25M sentences = 400MB of memory.  You can probably get
> away with a smaller hash in your case, maybe half of an MD5 (64 bits),
> saving you RAM.  You could also store this MD5 in your lucene index as
> a keyword field (after converting the 16-byte MD5 value into a 32-byte
> hex string, for example), providing a fast way to find duplicates at
> search time.
>
> If you can give more details on your requirements, people in this list
> can probably come up with some pretty good solutions.
>
> -chris
>
> On 6/12/05, Dave Kor <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I would like to poll the community's opinion on good strategies for
> identifying
> > duplicate documents in a lucene index.
> >
> > You see, I have an index containing roughly 25 million lucene documents. My
> task
> > requires me to work at sentence level so each lucene document actually
> contains
> > exactly one sentence. The issue I have right now is that sometimes, certain
> > sentences are duplicated and I'ld like to be able to identify them as a
> BitSet
> > so that I can filter away these duplicates in my search.
> >
> > Obviously the brute force method of pairwise compares would take forever. I
> have
> > tried grouping sentences using their hashCodes() and then do a pairwise
> compare
> > between sentences that has the same hashCode, but even with a 1GB heap I
> ran
> > out of memory after comparing 200k sentences.
> >
> > Any other ideas?
> >
> >
> > Regards
> > Dave Kor.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ideas Needed - Finding Duplicate Documents

Reply via email to