Re: Document ID shuffling under 2.3.x (on merge?)

Michael McCandless Thu, 13 Mar 2008 01:46:57 -0700


Daniel Noll wrote:

On Wednesday 12 March 2008 19:36:57 Michael McCandless wrote:
OK, I think very likely this is the issue: when IndexWriter hits an
exception while processing a document, the portion of the document
already indexed is left in the index, and then its docID is marked
for deletion.  You can see these deletions in your infoStream:
flush 0 buffered deleted terms and 30 deleted docIDs on 20segments
This means you have deletions in your index, by docID, and so when
you optimize the docIDs are then compacted.
Aha. Under 2.2, a failure would result in nothing being added tothe textindex so this would explain the problem. It would also explain whysmallerdata sets are less likely to cause the problem (it's less likelyfor there to
be an error in it.)


Yes.

Workarounds?
  - flush() after any IOException from addDocument()  (overhead?)

What exceptions are you actually hitting (is it really anIOException)? I thought something was going wrong in retrieving ortokenizing the document.

I don't think flush() helps because it just flushes the pendingdeletes as well?

  - use ++ to determine the next document ID instead of
index.getWriter().docCount() (out of sync after an error butfixes itself
    on optimize().

I think this would work, but you're definitely still in the realm of"guessing how Lucene assigns docIDs under the hood" so it's riskyover time. Likely this is the highest performance option.

But, when a normal merge of segments with deletions completes, yourdocIDs will shift. In trunk we now explicitly compute the docIDshifting that happens after a merge, because we don't always flushpending deletes when flushing added docs, but this is all doneprivately to IndexWriter.

I'm a little confused: you said optimize() introduces the problem,but, it sounds like optimize() should be fixing the problem becauseit compacts all docIDs to match what you were "guessing" outside ofLucene? Can you post the full stack trace of the exceptions you'rehitting?

- Use a field for a separate ID (slower later when reading theindex)


Looks too slow based on your results.

Can you pre-load the UID into the FieldCache? There were alsodiscussions recently about adding "column-stride" fields to Lucene,basically a faster FieldCache (to load initially), which would applyhere I think.

  - ???

Trunk has a new expungeDeletes method which should be lower cost thanoptimize, but not necessarily that much lower cost.


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document ID shuffling under 2.3.x (on merge?)

Reply via email to