subject:"Checking for duplicates inside index"

Re: Checking for duplicates inside index

2006-05-24 Thread Andrzej Bialecki

Hannes Carl Meyer wrote: Ken Krugler schrieb: On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is t

Re: Checking for duplicates inside index

2006-05-24 Thread Hannes Carl Meyer

Ken Krugler schrieb: On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is to create a checksum of th

Re: Checking for duplicates inside index

2006-05-23 Thread Yonik Seeley

On 5/23/06, Jimmy the Geek <[EMAIL PROTECTED]> wrote: Or any other suggestions on good ways to prevent duplicates? I am indexing with a field that has a unique ID, so it should be fairly straightforward... Solr does this efficiently: http://www.mail-archive.com/java-user@lucene.apache.org/msg05

RE: Checking for duplicates inside index

2006-05-23 Thread Jimmy the Geek

hat this feature is nice to have. > > Eugene > > > -Original Message- > From: Omar Didi [mailto:[EMAIL PROTECTED] > Sent: Monday, May 22, 2006 6:47 PM > To: java-user@lucene.apache.org > Subject: RE: Checking for duplicates inside index > > you have two

Re: Checking for duplicates inside index

2006-05-22 Thread Ken Krugler

On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... > My idea is to create a checksum of the documents content an

RE: Checking for duplicates inside index

2006-05-22 Thread Eugene Tuan

e.org Subject: RE: Checking for duplicates inside index you have two choices that I can think of: 1- before adding a document, check if it does't exist in the index. you can do this by querying on a unique field if you have it . 2- you can index all your documents, and once the indexing is do

RE: Checking for duplicates inside index

2006-05-22 Thread Omar Didi

t can help with this) if your index doesn't have a unique key, you need to add one like the one you suggested. -Original Message- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 6:05 PM To: java-user@lucene.apache.org Subject: Re: Checking for duplicates inside

Re: Checking for duplicates inside index

2006-05-22 Thread karl wettin

On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote: > > I'm indexing ~1 documents per day but since I'm getting a lot of > real duplicates (100% the same document content) I want to check the > content before indexing... > > My idea is to create a checksum of the documents content a

Checking for duplicates inside index

2006-05-22 Thread Hannes Carl Meyer

Hi All, I'm indexing ~1 documents per day but since I'm getting a lot of real duplicates (100% the same document content) I want to check the content before indexing... My idea is to create a checksum of the documents content and store it within document inside the index, before indexing

Re: Checking for duplicates inside index

Re: Checking for duplicates inside index

Re: Checking for duplicates inside index

RE: Checking for duplicates inside index

Re: Checking for duplicates inside index

RE: Checking for duplicates inside index

RE: Checking for duplicates inside index

Re: Checking for duplicates inside index

Checking for duplicates inside index

9 matches

Site Navigation

Mail list logo

Footer information