On Thu, Jan 2, 2014 at 7:53 PM, Derek Lewis <de...@lewisd.com> wrote:
> Sorry for the delay responding.  Holidays and all that. :)

No problem.

> The retry approach did work, our process finished in the end.  At some
> point, I suppose we'll just live with the chance this might happen and dump
> a bunch of exceptions into the log, if the effort to fix it is too high.
> Being pragmatic and all.

Fair enough :)  I do think retry is a valid approach.

> You are correct that preventing the duplicate indexing is hard.  We do have
> things in place to try to prevent it, emphasis on the "try".  Occasionally,
> things go wrong and we get a small number of duplicates, but on at least on
> occasion that number was anything but small. ;)
>
> I'm as sure as I can be that there were no merges running, since we're
> locking that directory before running this process. All our things that
> index use that same lock, so unless merges happen in a background thread
> within Lucene, rather than the calling thread that's adding new documents
> to the index, there should be no merges going on outside of this lock.  In
> that case, calling waitForMerges shouldn't have any effect.

Merging does run in a background thread by default
(ConcurrentMergeScheduler), and a still-running merge could be ongoing
when you "lock that directory".

I don't think IndexWriter kicks off merges on init today, but it's
free to (it's an impl detail).

Net/net one should not rely on when merges might happen...

> I know you've mentioned the infoStream a couple times :) But I don't think
> turning it on would be a good idea, in our case.  We've only had this
> problem crop up once, so there's no guarantee at all that it'll happen
> again, and the infoStream logging would be a lot of data with all the
> indexing we're doing.  Unfortunately, I just don't think it's feasible.

In fact infoStream doesn't generate THAT much data: it doesn't log for
every added doc.  Only when segment changes happen (a flush, a merge,
deletes applied, etc.).  And it can be very useful in post-mortem to
figure out what happened when something goes wrong.

> Thanks very much for the suggestion about FilterIndexReader with
> addIndices.  That sounds very promising.  I'm going to investigate doing
> our duplicate filtering that way instead.
>
> Thanks again for the help.  Cheers :)

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to