Hmm, which Lucene version are you using? We recently beefed up the checking in this code, so you ought to be hitting an exception in newer versions.
But that being said, I think the bug is real: if you try to reopen from a newer NRT reader down to an older (commit point) reader then you can hit this. Can you open an issue and maybe post a test case showing it? Thanks. Mike McCandless http://blog.mikemccandless.com On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vfunst...@gmail.com> wrote: > I think I see the bug here, but maybe I'm wrong. Here's my theory: > > Suppose no segments at a particular commit point contain any deletes. Now, > we also hold open an NRT reader into the index, which may end up with some > deletes, after the commit occurred. Then, according to the following > conditional in StandardDirectoryReader, we shall get into the two arg ctor > of SegmentReader: > > if (newReaders[i].getSegmentInfo().getDelGen() == > infos.info(i).getDelGen()) > { > // only DV updates > newReaders[i] = new SegmentReader(infos.info(i), > newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs()); > } else { > // both DV and liveDocs have changed > newReaders[i] = new SegmentReader(infos.info(i), > newReaders[i]); > } > > That constructor looks like this: > > SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException { > this(si, sr, > si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si, > IOContext.READONCE), > si.info.getDocCount() - si.getDelCount()); > } > > At this point, the SegmentInfo we're tryng to read live docs on is from the > commit point, and if there weren't any deletes, then the following results > in a null for the relevant file name, in > Lucene40LiveDocsFormat.readLiveDocs(): > > String filename = IndexFileNames.fileNameFromGeneration(info.info.name, > DELETES_EXTENSION, info.getDelGen()); > final BitVector liveDocs = new BitVector(dir, filename, context); > > This is where filename ends up being null, which gets passed all the way > down to the File constructor. > > In a nutshell, I think the bug is that it is assumed that the segments from > commit point have deletes, when they may not, yet the original > SegmentReader for the segment that we are trying to reuse does. > > What I am not quite clear yet is how we arrive at this point, because the > test that causes the exception doesn't do any deletes/updates... could > deletes occur as a result of a segment merge? This might explain the > sporadic nature of this exception, since merge timings aren't deterministic. > > > On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vfunst...@gmail.com> > wrote: > >> UPDATE: >> >> After making the changes we discussed to enable sharing of SegmentReaders >> between the NRT reader and a commit point reader, specifically calling >> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I >> am seeing this exception, sporadically: >> >> Caused by: java.lang.NullPointerException >> at java.io.File.<init>(File.java:305) >> at >> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80) >> at >> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327) >> at >> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90) >> at >> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320) >> at >> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278) >> at >> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260) >> at >> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183) >> >> Looking at the source quickly, it appears the child argument to the File >> ctor is null; is it somehow possible that the segment infos in the commit >> point wasn't fully written out somehow, on prior commit? Sounds unlikely, >> yet disturbing... but nothing else has changed in my code, i.e. the way >> commits are performed and indexes are reopened. >> >> >> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vfunst...@gmail.com> >>> wrote: >>> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless < >>> > luc...@mikemccandless.com> wrote: >>> > >>> >> >>> >> The segments_N file can be different, that's fine: after that, we then >>> >> re-use SegmentReaders when they are in common between the two commit >>> >> points. Each segments_N file refers to many segments... >>> >> >>> >> >>> > Yes, you are totally right - I didn't follow the code far enough the >>> first >>> > time around. :) This is an excellent idea, actually - I can probably >>> > arrange maintained commit points as an MRU data structure (e.g. >>> > LinkedHashMap with access order), and simply grab the most recently >>> opened >>> > reader to pass in when obtaining a new one from the new commit point - >>> to >>> > maximize segment reader reuse. >>> >>> That's great! >>> >>> >> You can set it (min and max) as high as you want; the only hard >>> >> requirement is that max >= 2*(min-1), I believe. >>> >> >>> > >>> > Looks like this is used inside Lucene41PostingsFormat, which simply >>> passes >>> > in those defaults - so you are effectively saying the minimum (and >>> > therefore, maximum) block size can be raised to reuse the size of the >>> terms >>> > index inside those TreeMap nodes? >>> >>> Yes, but it then increases cost at search time to locate a given term, >>> because more scanning is then required once we seek to the block that >>> might have the term. >>> >>> This reduces the size of the FST, but if RAM is being used by >>> something else inside BT, it won't help. But from your screen shot it >>> looked like it was almost entirely the FST, which is what I would >>> expect. >>> >>> >> > We are already using a customized codec though, so perhaps adding >>> >> > this to the codec is okay and transparent? >>> >> >>> >> Hmmm :) Customized in what manner? >>> >> >>> >> >>> > We need to have the ability to turn off stored fields compression, so >>> there >>> > is one codec in case the system is configured that way. The other one >>> > exists for compression on, but there I tweaked stored fields format for >>> > bias toward decompression, as well as a smaller chunk size - based on >>> some >>> > empirical observations in executed tests. I am guessing I'll just add >>> > another customization to both that deals with the block sizing for >>> postings >>> > format, and see what difference that makes... >>> >>> Ahh, OK. Yes, just add this custom terms index block sizing too. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org