Hannes Carl Meyer wrote:
Ken Krugler schrieb:
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> I'm indexing ~1 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
> My idea is t
Ken Krugler schrieb:
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> I'm indexing ~1 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
> My idea is to create a checksum of th
On 5/23/06, Jimmy the Geek <[EMAIL PROTECTED]> wrote:
Or any other suggestions on good ways to prevent duplicates? I am
indexing with a field that has a unique ID, so it should be fairly
straightforward...
Solr does this efficiently:
http://www.mail-archive.com/java-user@lucene.apache.org/msg05
hat this feature is nice to have.
>
> Eugene
>
>
> -Original Message-
> From: Omar Didi [mailto:[EMAIL PROTECTED]
> Sent: Monday, May 22, 2006 6:47 PM
> To: java-user@lucene.apache.org
> Subject: RE: Checking for duplicates inside index
>
> you have two
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
> I'm indexing ~1 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
> My idea is to create a checksum of the documents content an
e.org
Subject: RE: Checking for duplicates inside index
you have two choices that I can think of:
1- before adding a document, check if it does't exist in the index. you
can do this by querying on a unique field if you have it .
2- you can index all your documents, and once the indexing is do
t can help with this)
if your index doesn't have a unique key, you need to add one like the one you
suggested.
-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Monday, May 22, 2006 6:05 PM
To: java-user@lucene.apache.org
Subject: Re: Checking for duplicates inside
On Mon, 2006-05-22 at 23:42 +0200, Hannes Carl Meyer wrote:
>
> I'm indexing ~1 documents per day but since I'm getting a lot of
> real duplicates (100% the same document content) I want to check the
> content before indexing...
>
> My idea is to create a checksum of the documents content a
Hi All,
I'm indexing ~1 documents per day but since I'm getting a lot of
real duplicates (100% the same document content) I want to check the
content before indexing...
My idea is to create a checksum of the documents content and store it
within document inside the index, before indexing