Re: How to not overwrite a Document if it 'already exists'?

Antony Bowesman Tue, 05 May 2009 16:25:39 -0700

Michael McCandless wrote:

Lucene doesn't provide any way to do this, except opening a reader.


Opening a reader is not "that" expensive if you use it for this
purpose.  EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.

Thanks for that info. These indexes will be large, in the 10s of millions. idfield is unique and is 29 bytes. I guess that's still a lot of data to trawlthrough to get to the term.

But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.

That's precisely what I don't want to occur. I have two forms of a Document,which represent mail items. One 'full' version containing all index and storeddata, which represents a searchable mail item and one 'base', which is simply amarker Document which represents a mail in a forwarded mail chain, with just acouple of stored fields containing the mail meta data.

Under normal circumstances there are no problems as mails arrive in sequence andare never handled twice, but there is one case, during a reindex op, when thearrival of those mails can come out of sequence, i.e. a full mail is indexedfirst, but that mail is later processed as part of a forwarded mail chain ofanother mail.

It is the second time that mail is handled as a base mail that I do not want itto overwrite the full version.

Would it be technically difficult to support something like this in theIndexWriter API and if not, would it end up being more efficient that using areader/terms to check this?


Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to not overwrite a Document if it 'already exists'?

Reply via email to