Michael McCandless wrote:
Lucene doesn't provide any way to do this, except opening a reader.
Opening a reader is not "that" expensive if you use it for this
purpose. EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.
Thanks for that info. These indexes will be large, in the 10s of millions. id
field is unique and is 29 bytes. I guess that's still a lot of data to trawl
through to get to the term.
But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.
That's precisely what I don't want to occur. I have two forms of a Document,
which represent mail items. One 'full' version containing all index and stored
data, which represents a searchable mail item and one 'base', which is simply a
marker Document which represents a mail in a forwarded mail chain, with just a
couple of stored fields containing the mail meta data.
Under normal circumstances there are no problems as mails arrive in sequence and
are never handled twice, but there is one case, during a reindex op, when the
arrival of those mails can come out of sequence, i.e. a full mail is indexed
first, but that mail is later processed as part of a forwarded mail chain of
another mail.
It is the second time that mail is handled as a base mail that I do not want it
to overwrite the full version.
Would it be technically difficult to support something like this in the
IndexWriter API and if not, would it end up being more efficient that using a
reader/terms to check this?
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org