[jira] Issue Comment Edited: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

Nico Krijnen (JIRA) Tue, 16 Nov 2010 08:20:41 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932519#action_12932519
 ]


Nico Krijnen edited comment on LUCENE-2729 at 11/16/10 11:18 AM:
-----------------------------------------------------------------

We did a few more runs with lots more logging and *we think we have pinpointed 
the cause of the issue*. Logs and a short analyisis is attached: 
eof-extra-logging-4-analysis.txt

We added logging to track files that are opened for read and write. At the 
moment the 'read past EOF' happens, no files are still open for writing. So it 
does not seem as if a file is opened, but never closed. However we did find 
another interesting entry:

- Some time before the 'read past EOF' we see some activity in the disk index 
on the segments_... file that is 'locked' by the zoie snapshot.
-- 'our' segments_... file is deleted (the logged dir listings show that it 
does not exist anymore after that)
-- 'our' segments_... file is written to. At this point it is created again, 
but becomes 0 kb, which at a later point causes the 'read past EOF'.

We added more logging to find out more about when and who is 'setting' or 
'changing' the 'generation' in SegmentInfos.

The logs from that run seem to indicate that *at some point 
SegementInfos.generation seems to 'wrap' to a lower value and earlier 
generation numbers get re-used*.

In ZoieIndexDeletionPolicy, a Snapshot is initialized using a certain 
IndexCommit. Zoie doesn't take into account that Lucene may re-use segments 
filenames, no locking is applied to this file whatsoever in terms of keeping 
Lucene from re-using this specific segments filename.

AFAICT by going through the code in SegmentInfos, the generation counter is 
reset in SegmentInfos.read(Directory). When looking at the log files and 
singling out the *value of generation, in our case it doesn't even reach a 
value above 118*. The segments filename of our Snapshot is 'segments_2q' which 
translates to a generation of 98. After a while, Lucene deletes this file on 
disk (because the name is re-used?) in a call to FSDirectory.deleteFile. 
However, our backup is still running at that point. Somehow this file is 
written to disk a moment later and on the next read, we get the 'read past EOF' 
error.
It's not completely clear to me, but why is the generation counter reset on 
read(Directory)? Is it because the file is determined through 
FindSegmentsFile(Directory).run()?

Although this could be interpreted like a flaw in ZoieIndexDeletionPolicy, I 
think this should be worked around in Lucene (correct me if I'm wrong).

We can attempt some kind of 'fix' to skip generations corresponding to segment 
filenames used by the '_currentSnapshots' variable in ZoieIndexDeletionPolicy. 
But the IndexDeletionPolicy API or SegmentInfos class does not really provide a 
nice way of doing that and it doesn't feel like a proper fix.

*Is there a specific reason for the generation to 'wrap around' or be re-used 
within hours?*
Looking at the code in SegmentInfos I do not see any explicit logic or reasons 
for re-using earlier generation numbers.

      was (Author: nkrijnen):
    We did a few more runs with lots more logging and we think we have 
pinpointed the cause of the issue.

We added logging to track files that are opened for read and write. At the 
moment the 'read past EOF' happens, no files are still open for writing. So it 
does not seem as if a file is opened, but never closed. However we did find 
another interesting entry:

- Some time before the 'read past EOF' we see some activity in the disk index 
on the segments_... file that is 'locked' by the zoie snapshot.
-- 'our' segments_... file is deleted (the logged dir listings show that it 
does not exist anymore after that)
-- 'our' segments_... file is written to. At this point it is created again, 
but becomes 0 kb, which at a later point causes the 'read past EOF'.

We added more logging to find out more about when and who is 'setting' or 
'changing' the 'generation' in SegmentInfos.

The logs from that run seem to indicate that *at some point 
SegementInfos.generation seems to 'wrap' to a lower value and earlier 
generation numbers get re-used*.

In ZoieIndexDeletionPolicy, a Snapshot is initialized using a certain 
IndexCommit. Zoie doesn't take into account that Lucene may re-use segments 
filenames, no locking is applied to this file whatsoever in terms of keeping 
Lucene from re-using this specific segments filename.

AFAICT by going through the code in SegmentInfos, the generation counter is 
reset in SegmentInfos.read(Directory). When looking at the log files and 
singling out the *value of generation, in our case it doesn't even reach a 
value above 118*. The segments filename of our Snapshot is 'segments_2q' which 
translates to a generation of 98. After a while, Lucene deletes this file on 
disk (because the name is re-used?) in a call to FSDirectory.deleteFile. 
However, our backup is still running at that point. Somehow this file is 
written to disk a moment later and on the next read, we get the 'read past EOF' 
error.
It's not completely clear to me, but why is the generation counter reset on 
read(Directory)? Is it because the file is determined through 
FindSegmentsFile(Directory).run()?

Although this could be interpreted like a flaw in ZoieIndexDeletionPolicy, I 
think this should be worked around in Lucene (correct me if I'm wrong).

We can attempt some kind of 'fix' to skip generations corresponding to segment 
filenames used by the '_currentSnapshots' variable in ZoieIndexDeletionPolicy. 
But the IndexDeletionPolicy API or SegmentInfos class does not really provide a 
nice way of doing that and it doesn't feel like a proper fix.

*Is there a specific reason for the generation to 'wrap around' or be re-used 
within hours?*
Looking at the code in SegmentInfos I do not see any explicit logic or reasons 
for re-using earlier generation numbers.
  
> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2729
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>            Reporter: Nico Krijnen
>         Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> backup_force_failure2.log.zip, eof-extra-logging-4-analysis.txt, 
> eof-extra-logging-4.log.zip, LUCENE-2729-test1.patch
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
>     ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
>     at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
>     at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>     at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
>     at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
>     at 
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
>     at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
>     at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
>     at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
>     ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
>     Problem copying segments: Lock obtain timed out: 
>     org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
>     org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
>     at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
>     at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> {code}
> We get exactly the same behavour on both OS X and on Windows. On both zoie is 
> using a SimpleFSDirectory.
> We also use a SingleInstanceLockFactory (since our process is the only one 
> working with the index), but we get the same behaviour with a NativeFSLock.
> The snapshot backup is being made by calling:
> *proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)*
> Same issue in zoie JIRA:
> http://snaprojects.jira.com/browse/ZOIE-51

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export

Reply via email to