Re: Ideas Needed - Finding Duplicate Documents

Paul Libbrecht Mon, 13 Jun 2005 04:06:07 -0700

Have you tried comparing TermVectors ?

I would expect them, or an adjustment of them, to allow comparison tofocus on "important terms" (e.g. about a 100-200 terms) and then allowa more reasonable computation.


paul


Le 12 juin 05, à 16:37, Dave Kor a écrit :

Hi,
I would like to poll the community's opinion on good strategies foridentifying
duplicate documents in a lucene index.
You see, I have an index containing roughly 25 million lucenedocuments. My taskrequires me to work at sentence level so each lucene document actuallycontainsexactly one sentence. The issue I have right now is that sometimes,certainsentences are duplicated and I'ld like to be able to identify them asa BitSet
so that I can filter away these duplicates in my search.
Obviously the brute force method of pairwise compares would takeforever. I havetried grouping sentences using their hashCodes() and then do apairwise comparebetween sentences that has the same hashCode, but even with a 1GB heapI ran
out of memory after comparing 200k sentences.

Any other ideas?


Regards
Dave Kor.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Ideas Needed - Finding Duplicate Documents

Reply via email to