Michael McCandless wrote:
Lucene doesn't provide any way to do this, except opening a reader.

Opening a reader is not "that" expensive if you use it for this
purpose.  EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.

Thanks for that info. These indexes will be large, in the 10s of millions. id field is unique and is 29 bytes. I guess that's still a lot of data to trawl through to get to the term.

But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.

That's precisely what I don't want to occur. I have two forms of a Document, which represent mail items. One 'full' version containing all index and stored data, which represents a searchable mail item and one 'base', which is simply a marker Document which represents a mail in a forwarded mail chain, with just a couple of stored fields containing the mail meta data.

Under normal circumstances there are no problems as mails arrive in sequence and are never handled twice, but there is one case, during a reindex op, when the arrival of those mails can come out of sequence, i.e. a full mail is indexed first, but that mail is later processed as part of a forwarded mail chain of another mail.

It is the second time that mail is handled as a base mail that I do not want it to overwrite the full version.

Would it be technically difficult to support something like this in the IndexWriter API and if not, would it end up being more efficient that using a reader/terms to check this?

Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to