Erick: Thanks for your suggestion. I think another solution would be keeping an list of keywords that could uniquely identify a document in a database, and search for keywords before adding a new document. As querying database is fast, this probaly wouldn't cost much time. But this would request maintaining a database while indexing. I just wondered if lucene offers an interface identifying duplicates. I think identifying duplicate URLs when indexing web would be common. Best Wishes. ----- Original Message ----- From: "Erick Erickson" <erickerick...@gmail.com> To: <java-user@lucene.apache.org> Sent: Saturday, March 07, 2009 10:58 PM Subject: Re: Search while indexing
> First, you'll probably want to search the user list archive for this issue, > as > it's been discussed and you'll find more information than I can remember > off the top of my head. That said: > > 1> changes to an index are not visible until you reopen the reader. You > probably have to flush the writer in the meantime. And this will > be costly to do for every document. > > 2> How do you identify duplicates? If it's a short enough signature, > you could consider keeping an in-memory list and check that > while indexing. If you needed to update your index you could > simply use TermEnum/TermDocs to read all the values into > memory before adding to it. > > 3> You could consider using some kind of calculated signature of > the whole file for your key, but that may not suit your app. > > Best > Erick > > > > On Sat, Mar 7, 2009 at 12:21 AM, sonfon <son...@gmail.com> wrote: > >> Dear All, >> Now, I'm considering to build index for my application with lucene. >> However, as the document sources I'm going to index has many duplications, >> so before adding a document to an IndexWriter, I hope search in the index >> database first to see if a same document copy has already been added. I used >> IndexSearcher to search the same Dir while IndexWriter writing to it. >> However, it seem that IndexSearcher returned no result though I'm sure there >> are duplicate copies indexed already. And after the indexing procedure, I >> can get the search results, so I'm sure I didn't write the wrong code. >> Anyone could offer some help? Some example codes are appreciated. >> Best >> Wishes. >