Hi Sergey! Thank you for providing details. Have I understood correctly that you run newer version of Ignite on older persistent files? Is there any possibility that some data in your caches survived that 5 days of cluster work? I'm just trying to exclude any compatibility problems, like https://issues.apache.org/jira/browse/IGNITE-14252
Denis Chudov Software Engineer, Moscow +7 905 5775239 https://www.gridgain.com Powered by Apache® Ignite™ On Tue, Nov 23, 2021 at 8:32 AM Sergey Korotkov <serge.korot...@gmail.com> wrote: > Hello Denis, > > Yes, as I said in the original message we do use the expiration on > persistent caches. > > The corruptedPages_2021-11-08_11-03-21_999.txt and > corruptedPages_2021-11-09_12-43-12_449.txt files were generated by Ignite > on crash. They show that two different caches were affected. The first one > during the expiration and the second (next day) during rebalance eviction. > Both caches are persistent and use the expiration. > > I also run the diagnostic utility (IgniteWalConverter) the way it is > recommended in the error message (output attached as diag-* files). > > Is there any usefull information in these diag-* files which can help to > understand what and how was corruped in particular? > > *** > > Generally this was a test run of new 2.11.0 version in test environment. A > goal was to check if new version works fine with out application and also > can be safely stopped/started for maintenance. We do that since run into > the similar problem with 'B+Tree is corrupted' on production during > eviction rebalance (with 2.8.1). We see two similar issues fixed in > 2.11.0: https://issues.apache.org/jira/browse/IGNITE-12489 and > https://issues.apache.org/jira/browse/IGNITE-14093 and consider upgrade > if it would help. By the way fix of the IGNITE-12489 ( > https://github.com/apache/ignite/pull/8358/commits) contains a lot > changes with several attempts. May be it just fixes not all situations? > > Before the deactivation cluster works fine under our usual load about 5 > days. Load is about 300 requests per second each consists of several reads > and single write to caches with the expiration turned on. After that we > stop / start the cluster to emulate the situation we had on production with > 2.8.1 (load from our application was stopped as well before the > deactivation request). > > *** > > Caches configurations. The first one has an affinity key and interceptor > > public CacheConfiguration<ContactHistoryKey, ContactHistory> > getEdenContactHistoryCacheConfiguration() { > CacheConfiguration<ContactHistoryKey, ContactHistory> > cacheConfiguration = new CacheConfiguration<>(); > cacheConfiguration.setCacheMode(CacheMode.PARTITIONED); > cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, > 1024)); > cacheConfiguration.setBackups(1); > cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC); > int expirationDays = appConfig.getContactHistoryEdenExpirationDays(); > cacheConfiguration > .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new > Duration(TimeUnit.DAYS, expirationDays))); > cacheConfiguration.setInterceptor(new ContactHistoryInterceptor()); > return cacheConfiguration; > } > > public class ContactHistoryKey { > String sOfferId; > > @AffinityKeyMapped > String subsIdAffinityKey; > } > > CacheConfiguration<String, String> getChannelOfferIdCache() { > CacheConfiguration<String, String> cacheConfiguration = new > CacheConfiguration<>(); > cacheConfiguration.setCacheMode(CacheMode.PARTITIONED); > cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, > 1024)); > cacheConfiguration.setBackups(1); > cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC); > int expirationDays = appConfig.getChannelOfferIdCacheExpirationDays(); > cacheConfiguration > .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new > Duration(TimeUnit.DAYS, expirationDays))); > return cacheConfiguration; > } > > *** > > As for the other details. Not sure is it relevant or not. Deactivation > was relatevily long and log contains a lot of warnings between 2021-11-08 > 10:54:44 and 2021-11-08 10:59:33. Also there was a page locks dump at > 10:56:47,567. A lot of locks were logged for cache in the > Batch_Campaigns_Region (region without the persistence). > > [2021-11-08 > 10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} > Threads hanged: [(sys-#99087-105322, WAITING)] > [2021-11-08 > 10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} > Page locks dump: > > ................ > > Thread=[name=sys-#99087, id=105322], state=WAITING > Log overflow, size:512, headIdx=512 [structureId=7, > pageIdpageId=281474976790149 [pageIdHex=0001000000013685, partId=0, > pageIdx=79493, flags=00000001]] > Locked pages = > [844420635168174[0002ffff00000dae](r=0|w=1),844420635168175[0002ffff00000daf](r=0|w=1)] > Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08 > 07:55:27.247) > L=1 -> Write lock pageId=844420635168174, > structureId=batch_campaign_results-p-0##CacheData > [pageIdHex=0002ffff00000dae, partId=65535, pageIdx=3502, flags=00000002] > L=2 -> Write lock pageId=844420635168175, > structureId=batch_campaign_results-p-0##CacheData > [pageIdHex=0002ffff00000daf, partId=65535, pageIdx=3503, flags=00000002] > L=3 -> Write lock pageId=281474976716639, > structureId=Batch_Campaigns_Region##FreeList [pageIdHex=000100000000175f, > partId=0, pageIdx=5983, flags=00000001] > L=4 -> Write lock pageId=844420635243234, > structureId=Batch_Campaigns_Region##FreeList [pageIdHex=0002ffff000132e2, > partId=65535, pageIdx=78562, flags=00000002] > > a lot more ... > > ...................... > > Can it cause that something say not was checkpointed to the disk correctly? > > Thanks, > > -- > > Sergey > > > 22.11.2021 23:18, Denis Chudov пишет: > > Hi Sergey! > > Seems that the root cause of your problem is possible corruption of data > page > > Caused by: java.lang.IllegalStateException: Item not found: 3 > at > org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) > > and it happened twice, not during expiration cleanup, but also during > rebalance eviction. But your scenario looks quite usual and couldn't cause > such problems. > > I've found similar existing ticket > https://issues.apache.org/jira/browse/IGNITE-11400 > but assignee couldn't reproduce the error. > > Could you please provide more information about what had happened on your > cluster before and about cache configuration? Do you use expiration on > persistent caches? Any details can be useful for the problem localization. > > Denis Chudov > Software Engineer, Moscow > +7 905 5775239 > https://www.gridgain.com > Powered by Apache® Ignite™ > > > On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov <serge.korot...@gmail.com> > wrote: > >> Hello, >> >> We have troubles with the CorruptedTreeException: B+Tree is corrupted >> during the node start after cluster restart. Looks like the caches with >> the Expiry Policy configured are source of the problems. >> >> I have attached the log from the problem node. The exact steps with >> timestamps are as folows. Before the deactivation cluster works fine >> about 5 days >> >> 2021-11-08 10:54:44 cluster deactivate request >> >> 2021-11-08 10:59:33 cluster deactivated >> >> 2021-11-08 11:02:30 stop all nodes >> >> 2021-11-08 11:02:39 start all nodes >> >> 2021-11-08 11:03:14 auto-activation start >> >> 2021-11-08 11:03:16 cluster activated >> >> 2021-11-08 11:03:21 'B+Tree is corrupted' exception in >> GridCacheTtlManager.expire() on one of the nodes (see the >> 10.12.86.29-ignite-2021-11-08.0.log): >> >> [2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{} >> Critical system error detected. Will be handled accordingly to >> configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, >> timeout=0, super=AbstractFailureHandler >> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, >> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext >> [type=CRITICAL_ERROR, err=class >> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree >> is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on bounds: >> [lower=null, upper=PendingRow []]]]] >> >> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: >> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on >> bounds: [lower=null, upper=PendingRow []]] >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193) >> at >> >> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192) >> at >> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118) >> ... 11 common frames omitted >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068) >> ... 12 common frames omitted >> Caused by: java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316) >> ... 22 common frames omitted >> >> >> Error message suggests to run the >> org.apache.ignite.development.utis.IgniteWalConverter to diagnose the >> problem. I attached the output of this utility: >> >> - corruptedPages_2021-11-08_11-03-21_999.txt - file created by ignite >> on crash >> >> - diag-2021-11-08.txt - output of the diagnostic utility. >> >> >> Next day we try to start this node again and it still fail with 'B+Tree >> is corrupted' but in different place: >> >> 2021-11-09 12:37:34 'B+Tree is corrupted' exception in >> PartitionsEvictManager$PartitionEvictionTask.run() (see the >> 10.12.86.29-ignite-2021-11-09.0.log) >> >> >> [2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} Critical system >> error detected. Will be handled accordingly to configured handler >> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, >> super=AbstractFailu >> reHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, >> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext >> [type=CRITICAL_ERROR, err=class >> o.a.i.i.processors.cache.persistence.tree.CorruptedT >> reeException: B+Tree is corrupted [pages(groupId, pageId)=[], >> msg=Runtime failure on bounds: [lower=null, upper=null]]]] >> >> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: >> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on >> bounds: [lower=null, upper=null]] >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095) >> at >> >> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856) >> at >> >> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938) >> at >> >> org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53) >> at >> >> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45) >> at >> >> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987) >> at >> >> org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118) >> ... 16 common frames omitted >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136) >> at >> >> org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068) >> ... 17 common frames omitted >> Caused by: java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316) >> >> >> I also attached the corrupted pages file and diagnostic output for this >> case as well >> >> - corruptedPages_2021-11-09_12-43-12_449.txt >> >> - diag-2021-11-09.txt >> >> >> In both cases pages are for the caches with the Expire Policy configured. >> >> >> **** >> >> What can be done about that? Is there any recommended way to stop/start >> ignite cluster to prevent such data loss problems? >> >> **** >> >> I see some similar fixed issues in Jira like >> https://issues.apache.org/jira/browse/IGNITE-12489 or >> https://issues.apache.org/jira/browse/IGNITE-14093 but looks like >> something still not working in 2.11.0. >> >> >> Thanks, >> >> -- >> >> Sergey >> >> >> >>