Hmm, which Lucene version are you using?  We recently beefed up the
checking in this code, so you ought to be hitting an exception in
newer versions.

But that being said, I think the bug is real: if you try to reopen
from a newer NRT reader down to an older (commit point) reader then
you can hit this.

Can you open an issue and maybe post a test case showing it?  Thanks.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 9, 2014 at 2:30 AM, Vitaly Funstein <vfunst...@gmail.com> wrote:
> I think I see the bug here, but maybe I'm wrong. Here's my theory:
>
> Suppose no segments at a particular commit point contain any deletes. Now,
> we also hold open an NRT reader into the index, which may end up with some
> deletes, after the commit occurred. Then, according to the following
> conditional in StandardDirectoryReader, we shall get into the two arg ctor
> of SegmentReader:
>
>             if (newReaders[i].getSegmentInfo().getDelGen() ==
> infos.info(i).getDelGen())
> {
>               // only DV updates
>               newReaders[i] = new SegmentReader(infos.info(i),
> newReaders[i], newReaders[i].getLiveDocs(), newReaders[i].numDocs());
>             } else {
>               // both DV and liveDocs have changed
>               newReaders[i] = new SegmentReader(infos.info(i),
> newReaders[i]);
>             }
>
> That constructor looks like this:
>
>   SegmentReader(SegmentCommitInfo si, SegmentReader sr) throws IOException {
>     this(si, sr,
>          si.info.getCodec().liveDocsFormat().readLiveDocs(si.info.dir, si,
> IOContext.READONCE),
>          si.info.getDocCount() - si.getDelCount());
>   }
>
> At this point, the SegmentInfo we're tryng to read live docs on is from the
> commit point, and if there weren't any deletes, then the following results
> in a null for the relevant file name, in
> Lucene40LiveDocsFormat.readLiveDocs():
>
>     String filename = IndexFileNames.fileNameFromGeneration(info.info.name,
> DELETES_EXTENSION, info.getDelGen());
>     final BitVector liveDocs = new BitVector(dir, filename, context);
>
> This is where filename ends up being null, which gets passed all the way
> down to the File constructor.
>
> In a nutshell, I think the bug is that it is assumed that the segments from
> commit point have deletes, when they may not, yet the original
> SegmentReader for the segment that we are trying to reuse does.
>
> What I am not quite clear yet is how we arrive at this point, because the
> test that causes the exception doesn't do any deletes/updates... could
> deletes occur as a result of a segment merge? This might explain the
> sporadic nature of this exception, since merge timings aren't deterministic.
>
>
> On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein <vfunst...@gmail.com>
> wrote:
>
>> UPDATE:
>>
>> After making the changes we discussed to enable sharing of SegmentReaders
>> between the NRT reader and a commit point reader, specifically calling
>> through to DirectoryReader.openIfChanged(DirectoryReader, IndexCommit), I
>> am seeing this exception, sporadically:
>>
>> Caused by: java.lang.NullPointerException
>>         at java.io.File.<init>(File.java:305)
>>         at
>> org.terracotta.shaded.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:80)
>>         at
>> org.terracotta.shaded.lucene.codecs.lucene40.BitVector.<init>(BitVector.java:327)
>>         at
>> org.terracotta.shaded.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:90)
>>         at
>> org.terracotta.shaded.lucene.index.SegmentReader.<init>(SegmentReader.java:131)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:194)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:326)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader$2.doBody(StandardDirectoryReader.java:320)
>>         at
>> org.terracotta.shaded.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:702)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromCommit(StandardDirectoryReader.java:315)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:278)
>>         at
>> org.terracotta.shaded.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:260)
>>         at
>> org.terracotta.shaded.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:183)
>>
>> Looking at the source quickly, it appears the child argument to the File
>> ctor is null; is it somehow possible that the segment infos in the commit
>> point wasn't fully written out somehow, on prior commit? Sounds unlikely,
>> yet disturbing... but nothing else has changed in my code, i.e. the way
>> commits are performed and indexes are reopened.
>>
>>
>> On Fri, Aug 29, 2014 at 2:03 AM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein <vfunst...@gmail.com>
>>> wrote:
>>> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
>>> > luc...@mikemccandless.com> wrote:
>>> >
>>> >>
>>> >> The segments_N file can be different, that's fine: after that, we then
>>> >> re-use SegmentReaders when they are in common between the two commit
>>> >> points.  Each segments_N file refers to many segments...
>>> >>
>>> >>
>>> > Yes, you are totally right - I didn't follow the code far enough the
>>> first
>>> > time around. :) This is an excellent idea, actually - I can probably
>>> > arrange maintained commit points as an MRU data structure (e.g.
>>> > LinkedHashMap with access order), and simply grab the most recently
>>> opened
>>> > reader to pass in when obtaining a new one from the new commit point -
>>> to
>>> > maximize segment reader reuse.
>>>
>>> That's great!
>>>
>>> >> You can set it (min and max) as high as you want; the only hard
>>> >> requirement is that max >= 2*(min-1), I believe.
>>> >>
>>> >
>>> > Looks like this is used inside Lucene41PostingsFormat, which simply
>>> passes
>>> > in those defaults - so you are effectively saying the minimum (and
>>> > therefore, maximum) block size can be raised to reuse the size of the
>>> terms
>>> > index inside those TreeMap nodes?
>>>
>>> Yes, but it then increases cost at search time to locate a given term,
>>> because more scanning is then required once we seek to the block that
>>> might have the term.
>>>
>>> This reduces the size of the FST, but if RAM is being used by
>>> something else inside BT, it won't help.  But from your screen shot it
>>> looked like it was almost entirely the FST, which is what I would
>>> expect.
>>>
>>> >> > We are already using a customized codec though, so perhaps adding
>>> >> > this to the codec is okay and transparent?
>>> >>
>>> >> Hmmm :)  Customized in what manner?
>>> >>
>>> >>
>>> > We need to have the ability to turn off stored fields compression, so
>>> there
>>> > is one codec in case the system is configured that way. The other one
>>> > exists for compression on, but there I tweaked stored fields format for
>>> > bias toward decompression, as well as a smaller chunk size - based on
>>> some
>>> > empirical observations in executed tests. I am guessing I'll just add
>>> > another customization to both that deals with the block sizing for
>>> postings
>>> > format, and see what difference that makes...
>>>
>>> Ahh, OK.  Yes, just add this custom terms index block sizing too.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to