Daniel Noll wrote:
On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
OK, I think very likely this is the issue: when IndexWriter hits an
exception while processing a document, the portion of the document
already indexed is left in the index, and then its docID is marked
for deletion. You can see these deletions in your infoStream:
flush 0 buffered deleted terms and 30 deleted docIDs on 20
segments
This means you have deletions in your index, by docID, and so when
you optimize the docIDs are then compacted.
Aha. Under 2.2, a failure would result in nothing being added to
the text
index so this would explain the problem. It would also explain why
smaller
data sets are less likely to cause the problem (it's less likely
for there to
be an error in it.)
Yes.
Workarounds?
- flush() after any IOException from addDocument() (overhead?)
What exceptions are you actually hitting (is it really an
IOException)? I thought something was going wrong in retrieving or
tokenizing the document.
I don't think flush() helps because it just flushes the pending
deletes as well?
- use ++ to determine the next document ID instead of
index.getWriter().docCount() (out of sync after an error but
fixes itself
on optimize().
I think this would work, but you're definitely still in the realm of
"guessing how Lucene assigns docIDs under the hood" so it's risky
over time. Likely this is the highest performance option.
But, when a normal merge of segments with deletions completes, your
docIDs will shift. In trunk we now explicitly compute the docID
shifting that happens after a merge, because we don't always flush
pending deletes when flushing added docs, but this is all done
privately to IndexWriter.
I'm a little confused: you said optimize() introduces the problem,
but, it sounds like optimize() should be fixing the problem because
it compacts all docIDs to match what you were "guessing" outside of
Lucene? Can you post the full stack trace of the exceptions you're
hitting?
- Use a field for a separate ID (slower later when reading the
index)
Looks too slow based on your results.
Can you pre-load the UID into the FieldCache? There were also
discussions recently about adding "column-stride" fields to Lucene,
basically a faster FieldCache (to load initially), which would apply
here I think.
- ???
Trunk has a new expungeDeletes method which should be lower cost than
optimize, but not necessarily that much lower cost.
Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]