[ 
https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934857#action_12934857
 ] 

Jan te Beest commented on LUCENE-2729:
--------------------------------------

Hi Michael,

I'm one of Nico's colleagues and I've taken over where he left off.

After a lot of testing and debugging, I have found a workaround to the issue 
and I've collected more logging. 
Hopefully this will help to find a real fix (be it in Zoie of Lucene) instead 
of the dirty hack I've put in the code right now.

Attached is a zip-file with a full log (eof-logging-9.log) the same log 
stripped down to a period where the backup is happening, i.e. between creating 
the Snapshot and the moment where the 'read past EOF' is happening 
(stripped-log.log). The other to files are a svn-diff against lucene and a git 
patch against zoie.

First let me try to explain what I think is happening.

{noformat}11:50:07,449 [http-8080-exec-4] DEBUG 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy - Init Snapshot 
segmentsFileName=segments_5e{noformat}
Snapshot is initialized with IndexCommit(segments_5e)

It is not really apparent from this log as I had to restart the server due to 
those "load manager has stopped" ZoieExceptions, but the behaviour is the same 
as in all other test runs I did.
All files are properly written to the .snapshot file, including the segments_5e 
file. 

After the backup completes, the Snapshot is dereferenced.
{noformat}
12:43:06,988 [Finalizer] DEBUG 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy - 
Snapshot(IndexFileDeleter.CommitPoint(segments_5e)).refCount=0
java.lang.Exception: 12:43:06.988[Finalizer]

        at com.ds.util.log.TraceUtil.stackTrace(TraceUtil.java:20)
        at 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy$Snapshot.decRef(ZoieIndexDeletionPolicy.java:139)
        at 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy$Snapshot.close(ZoieIndexDeletionPolicy.java:123)
        at 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy$Snapshot.finalize(ZoieIndexDeletionPolicy.java:148)
        at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
        at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:83)
        at java.lang.ref.Finalizer.access$100(Finalizer.java:14)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:160)

12:43:06,991 [Finalizer] DEBUG 
proj.zoie.impl.indexing.internal.ZoieIndexDeletionPolicy - 
Snapshot(IndexFileDeleter.CommitPoint(segments_5e)), refcount is now -1
{noformat}

Now something is making the segments_5e file 0 bytes, I've not been able to 
figure out why this is happening though. I'll try to get some proper logging 
about this.

Where the 'read past EOF' happens is quite clear at this point however.

When a new IndexFileDeleter is initialized and the directory is scanned for 
segments_* files, i.e.
{noformat}
        if (fileName.startsWith(IndexFileNames.SEGMENTS)) {

          // This is a commit (segments or segments_N), and
          // it's valid (<= the max gen).  Load it, then
          // incref all files it refers to:
{noformat}

This piece of code yields the EOF error, which is obvious as the segments_ file 
is 0 bytes.
{noformat}
          SegmentInfos sis = new SegmentInfos();
          try {
            sis.read(directory, fileName);
          } catch (FileNotFoundException e) {
{noformat}

I added another check to the {noformat}if 
(fileName.startsWith(IndexFileNames.SEGMENTS)){noformat}. However that's not 
solving the problem, rather working around it.
{noformat}
        long fileLength = directory.fileLength(fileName);
        if (fileName.startsWith(IndexFileNames.SEGMENTS) && fileLength <= 0) {
                message("init: file "+fileName+" is invalid (has no length), 
skipping further handling");
        }
        
        if (fileName.startsWith(IndexFileNames.SEGMENTS) && fileLength > 0) {

          // This is a commit (segments or segments_N), and
          // it's valid (<= the max gen).  Load it, then
          // incref all files it refers to:
{noformat}

We don't see any more errors with this hack (in fact I was able to do 10 
backups of a 21GB index in a row). The segments_ files are deleted from the 
index directory after a while as well, so it's not like they have a positive 
refcount anywhere else (if I understand correctly).

So, somewhere in between the Snapshot being closed and a new IndexFileDeleter 
initialized the segments_* something is writing 0 bytes to that segments_* file.
Like said before, I have not been able to find the place where this happens. As 
it is only happening under heavy load (I had 25 threads reading and writing the 
index besides backing up the index to the snapshot file), I'm guessing it may 
be some synchronization issue?

Perhaps you get an idea now, if you want me to put in more logging or do 
something else to clear it up, just give me a hint.

Cheers,
Jan te Beest


> Index corruption after 'read past EOF' under heavy update load and snapshot 
> export
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-2729
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2729
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Index
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Happens on both OS X 10.6 and Windows 2008 Server. 
> Integrated with zoie (using a zoie snapshot from 2010-08-06: 
> zoie-2.0.0-snapshot-20100806.jar).
>            Reporter: Nico Krijnen
>         Attachments: 2010-11-02 IndexWriter infoStream log.zip, 
> backup_force_failure2.log.zip, eof-extra-logging-4-analysis.txt, 
> eof-extra-logging-4.log.zip, LUCENE-2729-test1.patch, 
> read-past-eof-debugging.zip
>
>
> We have a system running lucene and zoie. We use lucene as a content store 
> for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled 
> backups of the index. This works fine for small indexes and when there are 
> not a lot of changes to the index when the backup is made.
> On large indexes (about 5 GB to 19 GB), when a backup is made while the index 
> is being changed a lot (lots of document additions and/or deletions), we 
> almost always get a 'read past EOF' at some point, followed by lots of 'Lock 
> obtain timed out'.
> At that point we get lots of 0 kb files in the index, data gets lots, and the 
> index is unusable.
> When we stop our server, remove the 0kb files and restart our server, the 
> index is operational again, but data has been lost.
> I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. 
> Hopefully someone has some ideas where to look to fix this.
> Some more details...
> Stack trace of the read past EOF and following Lock obtain timed out:
> {code}
> 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
>     ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF
> java.io.IOException: read past EOF
>     at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154)
>     at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
>     at 
> org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37)
>     at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245)
>     at 
> org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:166)
>     at 
> org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987)
>     at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973)
>     at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162)
>     at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] 
>     ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - 
>     Problem copying segments: Lock obtain timed out: 
>     org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: 
>     org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock
>     at org.apache.lucene.store.Lock.obtain(Lock.java:84)
>     at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060)
>     at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:957)
>     at 
> proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176)
>     at 
> proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228)
>     at 
> proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153)
>     at 
> proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134)
>     at 
> proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171)
>     at 
> proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373)
> {code}
> We get exactly the same behavour on both OS X and on Windows. On both zoie is 
> using a SimpleFSDirectory.
> We also use a SingleInstanceLockFactory (since our process is the only one 
> working with the index), but we get the same behaviour with a NativeFSLock.
> The snapshot backup is being made by calling:
> *proj.zoie.impl.indexing.ZoieSystem.exportSnapshot(WritableByteChannel)*
> Same issue in zoie JIRA:
> http://snaprojects.jira.com/browse/ZOIE-51

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to