But I think you have a problem here with searching the lucene index and deleting duplicate titles. Say you have the following titles: title one title one is a nice file title one is a really nice file
Further assume you're about to add a duplicate "title one" Searching on "title one" will give you three hits and you want one, assuming you're using an analyzer that breaks on whitespace. Are you indexing something (perhaps not shown in your snippet) that uniquely identifies each title? What you may have to do is index another field UN_TOKENIZED that contains the entire title, or some clever representation of the title that's more compact in order to handle this issue. BTW, instead of searching with a query, it might be faster to use TermEnum on your unique field. If TermEnum finds a term like the one you're about to add, you already have the title and can either delete the one already in the index (see TermDocs) or just not add the current one.... I've had cases where we update documents already in the index, so I needed a FIFO clean-up step. What I did was just index *everything*, without trying to determine whether a document was already in the index. Then, as a post-index step, use TermDocs/TermEnum to make a single pass through the index deleting all but the last instance of a document. If you elect to do this, the trick is to construct your term like new Term("field", "") to enumerate ALL the terms... Best Erick On 3/18/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Markus, What you were thinking is fine - search and, if found, delete first, then add. Lucene allows duplicate and offers no automated way for avoiding them. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Markus Franz <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, March 18, 2007 3:10:36 PM Subject: Eliminate duplicates Hello! An excerpt from my Lucene use: Document document = new Document(); document.add(new Field("title", item.title, Field.Store.YES, = Field.Index.TOKENIZED)); document.add(new Field("desc", item.desc, Field.Store.YES, = Field.Index.TOKENIZED)); writer.addDocument(document); writer.optimize(); My problem is: I can add the same entry twice times. How can I avoid = this by checking the title? (Of course I thought of doing a search query = for the title before inserting, but isn't there a more cool way?) Markus --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]