Re: Checking for duplicates inside index

2006-05-24 Thread Andrzej Bialecki
Hannes Carl Meyer wrote: Ken Krugler schrieb: On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is t

Re: Checking for duplicates inside index

2006-05-24 Thread Hannes Carl Meyer
Ken Krugler schrieb: On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is to create a checksum of th

Re: Checking for duplicates inside index

2006-05-23 Thread Yonik Seeley
On 5/23/06, Jimmy the Geek <[EMAIL PROTECTED]> wrote: Or any other suggestions on good ways to prevent duplicates? I am indexing with a field that has a unique ID, so it should be fairly straightforward... Solr does this efficiently: http://www.mail-archive.com/java-user@lucene.apache.org/msg05

RE: Checking for duplicates inside index

2006-05-23 Thread Jimmy the Geek
hat this feature is nice to have. > > Eugene > > > -Original Message- > From: Omar Didi [mailto:[EMAIL PROTECTED] > Sent: Monday, May 22, 2006 6:47 PM > To: java-user@lucene.apache.org > Subject: RE: Checking for duplicates inside index > > you have two

Re: Checking for duplicates inside index

2006-05-22 Thread Ken Krugler
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is to create a checksum of the documents content an

RE: Checking for duplicates inside index

2006-05-22 Thread Eugene Tuan
e.org Subject: RE: Checking for duplicates inside index you have two choices that I can think of: 1- before adding a document, check if it does't exist in the index. you can do this by querying on a unique field if you have it . 2- you can index all your documents, and once the indexing is do

RE: Checking for duplicates inside index

2006-05-22 Thread Omar Didi
t can help with this) if your index doesn't have a unique key, you need to add one like the one you suggested. -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 6:05 PM To: java-user@lucene.apache.org Subject: Re: Checking for duplicates inside

Re: Checking for duplicates inside index

2006-05-22 Thread karl wettin
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > > I'm indexing ~1 documents per day but since I'm getting a lot of > real duplicates (100% the same document content) I want to check the > content before indexing... > > My idea is to create a checksum of the documents content a

Checking for duplicates inside index

2006-05-22 Thread Hannes Carl Meyer
Hi All, I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... My idea is to create a checksum of the documents content and store it within document inside the index, before indexing