the ticket is GEODE-1969 On Thu, Oct 6, 2016 at 4:01 PM, Darrel Schneider <dschnei...@pivotal.io> wrote:
> The call stack shows that Oplog.flush is calling "force." It only does > this when the system property "gemfire.syncWrites" is set to "true." I > think that is why you found this problem. I think all/most of our testing > runs with this sys prop at the default of "false." > > Currently the code in Oplog.flush that calls "force" does it outside of > the sync on "lock." > If you move the "force" code up into this sync block it should fix your > problem. > For example old code: > synchronized (this.lock/* olf */) { > if (olf.RAFClosed) { > return; > } > ByteBuffer bb = olf.writeBuf; > if (bb != null && bb.position() != 0) { > bb.flip(); > int flushed = 0; > do { > flushed += olf.channel.write(bb); > } while (bb.hasRemaining()); > // update bytesFlushed after entire writeBuffer is flushed to > fix bug > // 41201 > olf.bytesFlushed += flushed; > bb.clear(); > } > } > if (doSync) { > if (SYNC_WRITES) { > // Synch Meta Data as well as content > olf.channel.force(true); > } > } > new code with fix: > synchronized (this.lock/* olf */) { > if (olf.RAFClosed) { > return; > } > ByteBuffer bb = olf.writeBuf; > if (bb != null && bb.position() != 0) { > bb.flip(); > int flushed = 0; > do { > flushed += olf.channel.write(bb); > } while (bb.hasRemaining()); > // update bytesFlushed after entire writeBuffer is flushed to > fix bug > // 41201 > olf.bytesFlushed += flushed; > bb.clear(); > } > * if (doSync) {* > * if (SYNC_WRITES) {* > * // Synch Meta Data as well as content* > * olf.channel.force(true);* > * }* > * }* > * }* > > I will file a ticket on this. Let me know if this fix works for you. Also > be aware that setting "gemfire.syncWrites" to "true" can really slow down > the rate at which you can write to disk. > > > On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <goy...@vmware.com> wrote: > >> Hi Dan, >> >> We are doing put operations within transactions. >> >> Here is the stack trace: >> >> [warn 2016/09/29 06:17:16.297 UTC <Event Processor for >> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] >> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa >> llbackDispatcher@55bded67: Exception during processing batch 448 >> com.gemstone.gemfire.internal.cache.wan.GatewaySenderException: >> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa >> llbackDispatcher@55bded67: Exception during processing batch 448, caused >> by com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: >> nsxDiskStore: Failed writing key to >> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", >> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: >> nsxDiskStore: Failed writing key to >> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", >> caused by java.io.IOException: Stream Closed >> at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa >> llbackDispatcher.dispatchBatch(GatewaySenderEve >> ntCallbackDispatcher.java:173) >> at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa >> llbackDispatcher.dispatchBatch(GatewaySenderEve >> ntCallbackDispatcher.java:83) >> at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySende >> rEventProcessor.processQueue(AbstractGatewaySenderEventProce >> ssor.java:579) >> at com.gemstone.gemfire.internal.cache.wan.serial.SerialGateway >> SenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219) >> Caused by: com.gemstone.gemfire.cache.CacheClosedException: For >> DiskStore: nsxDiskStore: Failed writing key to >> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by >> com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: >> nsxDiskStore: Failed writing key to >> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", >> caused by java.io.IOException: Stream Closed >> at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper >> .generateCancelledException(GemFireCacheImpl.java:1299) >> at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(C >> ancelCriterion.java:82) >> at com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClose >> d(TXManagerImpl.java:606) >> at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXMa >> nagerImpl.java:279) >> at com.vmware.nsx.management.container.dao.gemfire.GemFireTxLog >> Dao.processTxLog(GemFireTxLogDao.java:119) >> at com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEv >> entListener.processEvents(TxLogAsyncEventListener.java:93) >> at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa >> llbackDispatcher.dispatchBatch(GatewaySenderEve >> ntCallbackDispatcher.java:164) >> ... 3 more >> Caused by: com.gemstone.gemfire.cache.DiskAccessException: For >> DiskStore: nsxDiskStore: Failed writing key to >> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by >> java.io.IOException: Stream Closed >> at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav >> a:5235) >> at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav >> a:5223) >> at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav >> a:5215) >> at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.j >> ava:3312) >> at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java >> :3375) >> at com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOlde >> st(Oplog.java:5877) >> at com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues >> (Oplog.java:5924) >> at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog. >> java:4751) >> at com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyF >> orCompact(Oplog.java:4594) >> at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java >> :6087) >> at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa >> ctor.compact(DiskStoreImpl.java:2970) >> at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa >> ctor.run(DiskStoreImpl.java:3034) >> at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(Disk >> StoreImpl.java:4701) >> at java.util.concurrent.Executors$RunnableAdapter.call( >> Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >> Executor.java:1142) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >> lExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.io.IOException: Stream Closed >> at java.io.RandomAccessFile.getFilePointer(Native Method) >> at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti >> bleRandomAccessFile.getFilePointer(Uninterruptible >> RandomAccessFile.java:59) >> at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti >> bleRandomAccessFile$UninterruptibleFileChannelImpl. >> doUninterruptibly(UninterruptibleRandomAccessFile.java:102) >> at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti >> bleRandomAccessFile$UninterruptibleFileChannelImpl.force(Uni >> nterruptibleRandomAccessFile.java:226) >> at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5 >> 202) >> at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav >> a:5232) >> ... 17 more >> >> Thanks >> Kapil >> >> -----Original Message----- >> From: Dan Smith [mailto:dsm...@pivotal.io] >> Sent: Tuesday, October 04, 2016 10:08 AM >> To: user@geode.incubator.apache.org >> Subject: Re: OplogCompactor race issue? >> >> Hi Kapil, >> >> I haven't seen this particular issue before. Do you have any stack traces >> of the exception from the OplogCompactor? What sort of operations are you >> doing? puts? >> >> -Dan >> >> On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <goy...@vmware.com> wrote: >> > Hi All, >> > >> > We have been testing a single cache node with a lot of data recently >> > and frequently run into this error: >> > >> > [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for >> > oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction >> oplog id(s): >> > oplog#6 >> > [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for >> > oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in >> > 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task> >> > tid=0x19] Deleted >> > oplog#6 crf for disk store nsxDiskStore. >> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] >> > Deleted >> > oplog#6 krf for disk store nsxDiskStore. >> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] >> > Deleted >> > oplog#6 drf for disk store nsxDiskStore. >> > [info 2016/09/29 06:17:03.887 UTC <Event Processor for >> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created >> > oplog#8 drf for disk store nsxDiskStore. >> > [info 2016/09/29 06:17:03.911 UTC <Event Processor for >> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created >> > oplog#8 crf for disk store nsxDiskStore. >> > [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19] >> > Created >> > oplog#7 krf for disk store nsxDiskStore. >> > [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for >> > oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction >> oplog id(s): >> > oplog#7 >> > [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for >> > oplog oplog#7> tid=0x19] A DiskAccessException has occurred while >> > writing to the disk for disk store nsxDiskStore. The cache will be >> closed. >> > ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: >> > nsxDiskStore: Failed writing key to >> > "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by >> > java.io.IOException: Stream Closed ?at >> > com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235) >> > >> > From the logs it appears there may be a race between threads "Idle >> > OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I >> > see that both are doing operations related to oplog#7. The former logs >> > creation of a KRF file, while the latter is trying to access either >> > the DRF or the CRF file. Now, is it possible that "Idle >> > OplogCompactor" closed the DRF/CRF files for oplog#7 as part of >> > creating the KRF for the same? This is what GemFire docs say about it: >> > >> > "After the oplog is closed, GemFire also attempts to create a krf >> > file, which contains the key names as well as the offset for the value >> > within the crf file." >> > >> > Based on the above, it's possible that oplog#7 was already closed and >> > its KRF was already created, when the compactor tried to access the >> files. >> > >> > Have any of you run into this error before? Any suggestions? >> > >> > Thanks >> > Kapil >> > >