On Tue, Sep 27, 2016 at 7:05 AM, Shai Erera <ser...@gmail.com> wrote: > Hmm ... the commit part of the two indexes is always tricky. The javadocs > are correct because the order of indexing is as follows: when you index a > document with facets, the facets are first added to the taxonomy index and > only then the document is indexed in IW. > > Therefore if you concurrently index and commit, then committing TIW first > ensures that all "known" facets up to this point are committed. Then when > you commit IW, the documents in there are guaranteed to have their facet > ordinals already in the committed TIW (which may at this point include more > facets than are indexed in IW, but that's OK).
Hmm but if you commit TIW first, then IW after, isn't it possible that after TIW commit finishes that I index a few more documents into IW that added new taxonomy nodes/labels/ordinals and then when I call IW.commit those last few documents are now referencing taxonomy nodes that do not exist in the TIW commit point? Mike McCandless http://blog.mikemccandless.com >> On Tue, Sep 27, 2016 at 2:08 AM, William Moss >> <will.m...@airbnb.com.invalid> wrote: >> > We're using Lucene 5.2.0 (I know it's old, we're in the process of >> > upgrading) to handle searching over our listings here at Airbnb. >> >> 6.2.1 is a compelling upgrade because of more efficient indexing and >> searching of numerics (among many other things!)... >> >> > I've been >> > digging into our realtime indexing code and how we use Lucene and I >> wanted >> > to check a few assumptions around synchronization, since we see some >> > periodic exceptions[1] that I can't quite explain. >> > >> > First, a tiny bit of background >> > 1. We use facets and therefore are writing realtime updates using both >> > a IndexWriter and DirectoryTaxonomyWriter. >> > 2. We have multiple update threads, consuming messages (from Kafka) and >> > updating the index. >> > 3. Once we process a batch of messages, we call commit (first on >> > DirectoryTaxonomyWriter then on IndexWriter). >> >> I see TaxonomyWriter's javadocs say that is the correct order, but I >> would have expected the opposite, if you are concurrently indexing >> documents. >> >> > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher. >> > 5. We periodically call forceMerge on our IndexWriter (to improve >> > performance). >> >> This is dubious: if your index continues to receive changes, you >> should skip forceMerge and let Lucene's natural merging run at >> defaults. forceMerge is an incredibly costly operation and it's >> unclear you get that much speedup at search time. >> >> > So, now to a few questions: >> > 1. My understand is the right way to handle a DirectoryTaxonomyWriter and >> > an IndexWriter is to call commit on DirectoryTaxonomyWriter before >> > IndexWriter. Is this correct? Since we're using multiple threads, we need >> > to synchronize these calls within the process regardless, but curious to >> > understand the design. >> >> You should not have to block index updates while committing, if you >> don't need/want to. >> >> If you don't block updates, I would think you need to commit the >> DirectoryTaxonomyWriter second so that any new nodes in the taxonomy >> tree, referenced by the main index, are guaranteed to be present in >> the DirectoryTaxonomyWriter's commit. >> >> Maybe Shai can shed some more light here... >> >> > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those >> > need to be synchronized with the commit calls to either IndexWriter or >> > DirectoryTaxonomyWriter? >> >> No. >> >> Commit can be a costly, slow operation (calling fsync on N files), and >> it's designed internally in IndexWriter to not block operations like >> merging and refreshing. >> >> > Do we need to call it after ever time we call >> > commit? The comment suggests we call it "periodically," but I'm not >> clear >> > on how often that should be or what conditions trigger the index to >> change >> > in way that this would be required. >> >> You don't have to call refresh on every commit. When you call it is >> entirely up to you. >> >> Commit makes changes durable on disk, so an OS crash, power loss, >> etc., won't lose those changes (a bad disk WILL lose them of course). >> >> Refresh makes changes visible for searching. >> >> The two ops are entirely separate. >> >> Some apps call commit periodically and never refresh, others call >> refresh periodically and never commit :) It's your call. >> >> > 3. Lastly, what about forceMerge? Is there any worry there or can that >> just >> > safely happen in the background? Is there any need to call commit >> > afterward? Or does forceMerge effectively do that? >> >> Force merge does not call commit itself. >> >> If you do force merge, then it is a good idea to both commit and >> refresh afterwards, as this will let Lucene free up resources (files, >> file descriptors) with the old un-merged segments. >> >> > Presumably, we would not >> > see the new index until maybeRefresh was called the next time? >> >> Exactly. >> >> > Sorry, that was a lot of questions, would love help on any and all of >> them. >> >> No worries, keep them coming! >> >> > [1] When calling maybeRefresh, we've seen error that look like: >> > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe >> >> Need the full stack trace / context here to understand what's happening... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org