Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Michael McCandless Wed, 28 Sep 2016 03:26:49 -0700

On Wed, Sep 28, 2016 at 3:05 AM, William Moss
<will.m...@airbnb.com.invalid> wrote:
> Thank you both for your quick reply!


You're welcome!

> * We actually tried the upgrade to 6.0 a few months back (when that was the
> newest) and were getting similar errors to the ones I'm seeing now. We were
> not able to track them down, which is part of the motivation for me asking
> all these questions. We'll get there though :-)

OK, we gotta get to the root cause.  Sounds like it happens in either version...

> * The last time we tested this (which I think was still post
> ConcurrentMergePolicy) we saw that the read speed would slowly degrade over
> time. My understanding was that forceMerge was very expensive, but would
> make reads faster once complete. Is this not correct?

It really depends on what queries you are running.  Really you should
test in your use case and be certain that the massive expense of force
merge is worthwhile / necessary.  In general it's not worth it, ever
if searches are a bit faster, except for indices that will never
change again.

> Also, we never
> attempted to tune the MergePolicy at all, so while were on the subject, is
> there good documentation on how to do that? I'm much prefer to get away
> from calling forceMerge. If it's useful information, we've got a relatively
> small corpus, only ~2+M documents.

Just use the defaults :)  Tuning those settings is dangerous unless
you have a very specific problem to fix.

> * We want to be able to ensure that if a machine or JVM crashes we are in a
> coherent state. To that end, we need to call commit on Lucene and then
> commit back what we've read so far to Kafka. Calling commit is the only way
> to ensure this, right?

Correct: commit in Lucene, then notify Kafka what offset you had
indexed just before you called IW.commit.

But you may want to replicate the index across machines if you don't
want to have a single point of failure.  We recently added
near-real-time replication to Lucene for this use case ...

> * To make sure I understand how maybeRefresh works, ignoring whether or not
> we commit for a second, if I add a document via IndexWriter, it will not be
> reflected in IndexSearchers I get by calling acquire on SearcherAndTaxonomy
> until I call maybeRefresh?

Correct.

> Now, on to the concurrency issue. I was thinking a little more about this
> and I think the fundamental issue is that while IndexWriter and
> DirectoryTaxonomyWriter are each thread safe, them together are not. As
> suggested by the documentation, we use one instance each of IndexWriter,
> DirectoryTaxonomyWriter and SearcherTaxonomyManager. Imagine the following
> scenario:
> [Thread 1] Add document to DirectoryTaxonomyWriter
> [Thread 1] Add document to IndexWriter
> [Thread 1] Call commit on DirectoryTaxonomyWriter
> [Thread 2] Add document to DirectoryTaxonomyWriter
> [Thread 2] Add document to IndexWriter
> [Thread 1] Call commit on IndexWriter
> The on disc representation now should contain things in the IndexWriter
> that are not contained in the DirectoryTaxonomyWriter, right?

Correct, I'm also confused about the commit order for this reason.
Let's see what Shai says.

However, that should not lead to NSFE.  At worst it should lead to
"ordinal is not known" (maybe as an AIOOBE) from the taxonomy reader.

> Assuming maybeRefresh looks at the state on disk when it's doing it's
> update (if this not true I don't understand why it was throwing
> NoSuchFileException) then it can be out of sync as well?

maybeRefresh (assuming new docs were indexed since you last called it)
will write new index files holding those indexed docs, and then open
them to do searching over them.

> I apparently never made a full copy of the stack trace. I'll attempt to
> regenerate it and post it here once I have it.

OK, we need to understand that.  It should not be happening ;)

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization

Reply via email to