Re: OplogCompactor race issue?

Darrel Schneider Thu, 06 Oct 2016 16:13:08 -0700

the ticket is GEODE-1969

On Thu, Oct 6, 2016 at 4:01 PM, Darrel Schneider <dschnei...@pivotal.io>
wrote:


> The call stack shows that Oplog.flush is calling "force." It only does
> this when the system property "gemfire.syncWrites" is set to "true." I
> think that is why you found this problem. I think all/most of our testing
> runs with this sys prop at the default of "false."
>
> Currently the code in Oplog.flush that calls "force" does it outside of
> the sync on "lock."
> If you move the "force" code up into this sync block it should fix your
> problem.
> For example old code:
>       synchronized (this.lock/* olf */) {
>         if (olf.RAFClosed) {
>           return;
>         }
>         ByteBuffer bb = olf.writeBuf;
>         if (bb != null && bb.position() != 0) {
>           bb.flip();
>           int flushed = 0;
>           do {
>             flushed += olf.channel.write(bb);
>           } while (bb.hasRemaining());
>           // update bytesFlushed after entire writeBuffer is flushed to
> fix bug
>           // 41201
>           olf.bytesFlushed += flushed;
>           bb.clear();
>         }
>       }
>       if (doSync) {
>         if (SYNC_WRITES) {
>           // Synch Meta Data as well as content
>           olf.channel.force(true);
>         }
>       }
> new code with fix:
>       synchronized (this.lock/* olf */) {
>         if (olf.RAFClosed) {
>           return;
>         }
>         ByteBuffer bb = olf.writeBuf;
>         if (bb != null && bb.position() != 0) {
>           bb.flip();
>           int flushed = 0;
>           do {
>             flushed += olf.channel.write(bb);
>           } while (bb.hasRemaining());
>           // update bytesFlushed after entire writeBuffer is flushed to
> fix bug
>           // 41201
>           olf.bytesFlushed += flushed;
>           bb.clear();
>         }
>  *       if (doSync) {*
> *          if (SYNC_WRITES) {*
> *            // Synch Meta Data as well as content*
> *            olf.channel.force(true);*
> *          }*
> *        }*
> *      }*
>
> I will file a ticket on this. Let me know if this fix works for you. Also
> be aware that setting "gemfire.syncWrites" to "true" can really slow down
> the rate at which you can write to disk.
>
>
> On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <goy...@vmware.com> wrote:
>
>> Hi Dan,
>>
>> We are doing put operations within transactions.
>>
>> Here is the stack trace:
>>
>> [warn 2016/09/29 06:17:16.297 UTC <Event Processor for
>> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19]
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher@55bded67: Exception during processing batch 448
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderException:
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher@55bded67: Exception during processing batch 448, caused
>> by com.gemstone.gemfire.cache.CacheClosedException: For DiskStore:
>> nsxDiskStore: Failed writing key to 
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> nsxDiskStore: Failed writing key to 
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:173)
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:83)
>>         at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySende
>> rEventProcessor.processQueue(AbstractGatewaySenderEventProce
>> ssor.java:579)
>>         at com.gemstone.gemfire.internal.cache.wan.serial.SerialGateway
>> SenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219)
>> Caused by: com.gemstone.gemfire.cache.CacheClosedException: For
>> DiskStore: nsxDiskStore: Failed writing key to
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> nsxDiskStore: Failed writing key to 
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper
>> .generateCancelledException(GemFireCacheImpl.java:1299)
>>         at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(C
>> ancelCriterion.java:82)
>>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClose
>> d(TXManagerImpl.java:606)
>>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXMa
>> nagerImpl.java:279)
>>         at com.vmware.nsx.management.container.dao.gemfire.GemFireTxLog
>> Dao.processTxLog(GemFireTxLogDao.java:119)
>>         at com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEv
>> entListener.processEvents(TxLogAsyncEventListener.java:93)
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:164)
>>         ... 3 more
>> Caused by: com.gemstone.gemfire.cache.DiskAccessException: For
>> DiskStore: nsxDiskStore: Failed writing key to
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5235)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5223)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5215)
>>         at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.j
>> ava:3312)
>>         at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java
>> :3375)
>>         at com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOlde
>> st(Oplog.java:5877)
>>         at com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues
>> (Oplog.java:5924)
>>         at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.
>> java:4751)
>>         at com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyF
>> orCompact(Oplog.java:4594)
>>         at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java
>> :6087)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa
>> ctor.compact(DiskStoreImpl.java:2970)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa
>> ctor.run(DiskStoreImpl.java:3034)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(Disk
>> StoreImpl.java:4701)
>>         at java.util.concurrent.Executors$RunnableAdapter.call(
>> Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: Stream Closed
>>         at java.io.RandomAccessFile.getFilePointer(Native Method)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile.getFilePointer(Uninterruptible
>> RandomAccessFile.java:59)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile$UninterruptibleFileChannelImpl.
>> doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile$UninterruptibleFileChannelImpl.force(Uni
>> nterruptibleRandomAccessFile.java:226)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5
>> 202)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5232)
>>         ... 17 more
>>
>> Thanks
>> Kapil
>>
>> -----Original Message-----
>> From: Dan Smith [mailto:dsm...@pivotal.io]
>> Sent: Tuesday, October 04, 2016 10:08 AM
>> To: user@geode.incubator.apache.org
>> Subject: Re: OplogCompactor race issue?
>>
>> Hi Kapil,
>>
>> I haven't seen this particular issue before. Do you have any stack traces
>> of the exception from the OplogCompactor? What sort of operations are you
>> doing? puts?
>>
>> -Dan
>>
>> On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <goy...@vmware.com> wrote:
>> > Hi All,
>> >
>> > We have been testing a single cache node with a lot of data recently
>> > and frequently run into this error:
>> >
>> > [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction
>> oplog id(s):
>> > oplog#6
>> > [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
>> > 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
>> > tid=0x19] Deleted
>> > oplog#6 crf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
>> > Deleted
>> > oplog#6 krf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
>> > Deleted
>> > oplog#6 drf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:03.887 UTC <Event Processor for
>> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
>> > oplog#8 drf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:03.911 UTC <Event Processor for
>> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
>> > oplog#8 crf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
>> > Created
>> > oplog#7 krf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction
>> oplog id(s):
>> > oplog#7
>> > [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
>> > writing to the disk for disk store nsxDiskStore. The cache will be
>> closed.
>> > ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> > nsxDiskStore: Failed writing key to
>> > "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> > java.io.IOException: Stream Closed ?at
>> > com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>> >
>> > From the logs it appears there may be a race between threads "Idle
>> > OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
>> > see that both are doing operations related to oplog#7. The former logs
>> > creation of a KRF file, while the latter is trying to access either
>> > the DRF or the CRF file. Now, is it possible that "Idle
>> > OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
>> > creating the KRF for the same? This is what GemFire docs say about it:
>> >
>> > "After the oplog is closed, GemFire also attempts to create a krf
>> > file, which contains the key names as well as the offset for the value
>> > within the crf file."
>> >
>> > Based on the above, it's possible that oplog#7 was already closed and
>> > its KRF was already created, when the compactor tried to access the
>> files.
>> >
>> > Have any of you run into this error before? Any suggestions?
>> >
>> > Thanks
>> > Kapil
>>
>
>

Re: OplogCompactor race issue?

Reply via email to