Hello, Zhenya 26.11.2021 12:32, Zhenya Stanilovsky пишет: > probably this is the > case: https://issues.apache.org/jira/browse/IGNITE-15990 > <https://issues.apache.org/jira/browse/IGNITE-15990> >
Thanks for pointing! I also noticed this one. I investigate more and suspect that it can be the same issue (concurrent inserts and deletes on ttl) but on the another level. Issue above is about the corrupted B+ Tree structure as such. But in our case the problem is in the data page as such. The call stack and error message (Item not found: 3) shows that the data page doesn't contain the "indirect" item with id = 3 in the items array. May be it is really missed or may be the invariant that the inderect items are stored in sorted order is broken so binary search in AbstractDataPageIO.findIndirectItemIndex() fails. Thanks, - - Sergey > > > > > Hello Denis, > > Yes, as I said in the original message we do use the expiration on > persistent caches. > > The corruptedPages_2021-11-08_11-03-21_999.txt and > corruptedPages_2021-11-09_12-43-12_449.txt files were generated by > Ignite on crash. They show that two different caches were > affected. The first one during the expiration and the second (next > day) during rebalance eviction. Both caches are persistent and > use the expiration. > > I also run the diagnostic utility (IgniteWalConverter) the way it > is recommended in the error message (output attached as diag-* > files). > > Is there any usefull information in these diag-* files which can > help to understand what and how was corruped in particular? > > *** > > Generally this was a test run of new 2.11.0 version in test > environment. A goal was to check if new version works fine with > out application and also can be safely stopped/started for > maintenance. We do that since run into the similar problem with > 'B+Tree is corrupted' on production during eviction rebalance > (with 2.8.1). We see two similar issues fixed in 2.11.0: > https://issues.apache.org/jira/browse/IGNITE-12489 > <https://issues.apache.org/jira/browse/IGNITE-12489> and > https://issues.apache.org/jira/browse/IGNITE-14093 > <https://issues.apache.org/jira/browse/IGNITE-14093> and consider > upgrade if it would help. By the way fix of the IGNITE-12489 > (https://github.com/apache/ignite/pull/8358/commits) contains a > lot changes with several attempts. May be it just fixes not all > situations? > > Before the deactivation cluster works fine under our usual load > about 5 days. Load is about 300 requests per second each consists > of several reads and single write to caches with the expiration > turned on. After that we stop / start the cluster to emulate the > situation we had on production with 2.8.1 (load from our > application was stopped as well before the deactivation request). > > *** > > Caches configurations. The first one has an affinity key and > interceptor > > public CacheConfiguration<ContactHistoryKey, ContactHistory> > getEdenContactHistoryCacheConfiguration() { > CacheConfiguration<ContactHistoryKey, ContactHistory> > cacheConfiguration = new CacheConfiguration<>(); > cacheConfiguration.setCacheMode(CacheMode.PARTITIONED); > cacheConfiguration.setAffinity(new > RendezvousAffinityFunction(false, 1024)); > cacheConfiguration.setBackups(1); > cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC); > int expirationDays = > appConfig.getContactHistoryEdenExpirationDays(); > cacheConfiguration > .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new > Duration(TimeUnit.DAYS, expirationDays))); > cacheConfiguration.setInterceptor(new > ContactHistoryInterceptor()); > return cacheConfiguration; > } > > public class ContactHistoryKey { > String sOfferId; > > @AffinityKeyMapped > String subsIdAffinityKey; > } > > CacheConfiguration<String, String> getChannelOfferIdCache() { > CacheConfiguration<String, String> cacheConfiguration = new > CacheConfiguration<>(); > cacheConfiguration.setCacheMode(CacheMode.PARTITIONED); > cacheConfiguration.setAffinity(new > RendezvousAffinityFunction(false, 1024)); > cacheConfiguration.setBackups(1); > cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC); > int expirationDays = > appConfig.getChannelOfferIdCacheExpirationDays(); > cacheConfiguration > .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new > Duration(TimeUnit.DAYS, expirationDays))); > return cacheConfiguration; > } > > *** > > As for the other details. Not sure is it relevant or not. > Deactivation was relatevily long and log contains a lot of > warnings between 2021-11-08 10:54:44 and 2021-11-08 10:59:33. > Also there was a page locks dump at 10:56:47,567. A lot of locks > were logged for cache in the Batch_Campaigns_Region (region > without the persistence). > > [2021-11-08 > 10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} > Threads hanged: [(sys-#99087-105322, WAITING)] > [2021-11-08 > 10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} > Page locks dump: > > ................ > > Thread=[name=sys-#99087, id=105322], state=WAITING > Log overflow, size:512, headIdx=512 [structureId=7, > pageIdpageId=281474976790149 [pageIdHex=0001000000013685, > partId=0, pageIdx=79493, flags=00000001]] > Locked pages = > > [844420635168174[0002ffff00000dae](r=0|w=1),844420635168175[0002ffff00000daf](r=0|w=1)] > Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08 > 07:55:27.247) > L=1 -> Write lock pageId=844420635168174, > structureId=batch_campaign_results-p-0##CacheData > [pageIdHex=0002ffff00000dae, partId=65535, pageIdx=3502, > flags=00000002] > L=2 -> Write lock pageId=844420635168175, > structureId=batch_campaign_results-p-0##CacheData > [pageIdHex=0002ffff00000daf, partId=65535, pageIdx=3503, > flags=00000002] > L=3 -> Write lock pageId=281474976716639, > structureId=Batch_Campaigns_Region##FreeList > [pageIdHex=000100000000175f, partId=0, pageIdx=5983, flags=00000001] > L=4 -> Write lock pageId=844420635243234, > structureId=Batch_Campaigns_Region##FreeList > [pageIdHex=0002ffff000132e2, partId=65535, pageIdx=78562, > flags=00000002] > > a lot more ... > > ...................... > > Can it cause that something say not was checkpointed to the disk > correctly? > > Thanks, > > -- > > Sergey > > > > 22.11.2021 23:18, Denis Chudov пишет: >> Hi Sergey! >> >> Seems that the root cause of your problem is possible corruption >> of data page >> >> Caused by: java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) >> >> and it happened twice, not during expiration cleanup, but also >> during rebalance eviction. But your scenario looks quite usual >> and couldn't cause such problems. >> >> I've found similar existing >> ticket https://issues.apache.org/jira/browse/IGNITE-11400 >> <https://issues.apache.org/jira/browse/IGNITE-11400> >> but assignee couldn't reproduce the error. >> >> Could you please provide more information about what had happened >> on your cluster before and about cache configuration? Do you use >> expiration on persistent caches? Any details can be useful for >> the problem localization. >> >> Denis Chudov >> Software Engineer, Moscow >> +7 905 5775239 >> https://www.gridgain.com <https://www.gridgain.com/> >> Powered by Apache® Ignite™ >> >> On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov >> <serge.korot...@gmail.com >> <//e.mail.ru/compose/?mailto=mailto%3aserge.korot...@gmail.com>> >> wrote: >> >> Hello, >> >> We have troubles with the CorruptedTreeException: B+Tree is >> corrupted >> during the node start after cluster restart. Looks like the >> caches with >> the Expiry Policy configured are source of the problems. >> >> I have attached the log from the problem node. The exact >> steps with >> timestamps are as folows. Before the deactivation cluster >> works fine >> about 5 days >> >> 2021-11-08 10:54:44 cluster deactivate request >> >> 2021-11-08 10:59:33 cluster deactivated >> >> 2021-11-08 11:02:30 stop all nodes >> >> 2021-11-08 11:02:39 start all nodes >> >> 2021-11-08 11:03:14 auto-activation start >> >> 2021-11-08 11:03:16 cluster activated >> >> 2021-11-08 11:03:21 'B+Tree is corrupted' exception in >> GridCacheTtlManager.expire() on one of the nodes (see the >> 10.12.86.29-ignite-2021-11-08.0.log): >> >> [2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{} >> Critical system error detected. Will be handled accordingly to >> configured handler [hnd=StopNodeOrHaltFailureHandler >> [tryStop=false, >> timeout=0, super=AbstractFailureHandler >> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, >> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext >> [type=CRITICAL_ERROR, err=class >> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: >> B+Tree >> is corrupted [pages(groupId, pageId)=[], msg=Runtime failure >> on bounds: >> [lower=null, upper=PendingRow []]]]] >> >> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: >> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime >> failure on >> bounds: [lower=null, upper=PendingRow []]] >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193) >> at >> >> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769) >> at >> >> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192) >> at >> >> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118) >> ... 11 common frames omitted >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127) >> at >> >> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068) >> ... 12 common frames omitted >> Caused by: java.lang.IllegalStateException: Item not found: 3 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316) >> ... 22 common frames omitted >> >> >> Error message suggests to run the >> org.apache.ignite.development.utis.IgniteWalConverter to >> diagnose the >> problem. I attached the output of this utility: >> >> - corruptedPages_2021-11-08_11-03-21_999.txt - file created >> by ignite >> on crash >> >> - diag-2021-11-08.txt - output of the diagnostic utility. >> >> >> Next day we try to start this node again and it still fail >> with 'B+Tree >> is corrupted' but in different place: >> >> 2021-11-09 12:37:34 'B+Tree is corrupted' exception in >> PartitionsEvictManager$PartitionEvictionTask.run() (see the >> 10.12.86.29-ignite-2021-11-09.0.log) >> >> >> [2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} >> Critical system >> error detected. Will be handled accordingly to configured handler >> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, >> super=AbstractFailu >> reHandler [ignoredFailureTypes=UnmodifiableSet >> [SYSTEM_WORKER_BLOCKED, >> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext >> [type=CRITICAL_ERROR, err=class >> o.a.i.i.processors.cache.persistence.tree.CorruptedT >> reeException: B+Tree is corrupted [pages(groupId, pageId)=[], >> msg=Runtime failure on bounds: [lower=null, upper=null]]]] >> >> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: >> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime >> failure on >> bounds: [lower=null, upper=null]] >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095) >> at >> >> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856) >> at >> >> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938) >> at >> >> org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53) >> at >> >> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45) >> at >> >> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987) >> at >> >> org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409) >> at >> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at >> java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118) >> ... 16 common frames omitted >> Caused by: >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException: >> java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136) >> at >> >> org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422) >> at >> >> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068) >> ... 17 common frames omitted >> Caused by: java.lang.IllegalStateException: Item not found: 16 >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380) >> at >> >> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316) >> >> >> I also attached the corrupted pages file and diagnostic >> output for this >> case as well >> >> - corruptedPages_2021-11-09_12-43-12_449.txt >> >> - diag-2021-11-09.txt >> >> >> In both cases pages are for the caches with the Expire Policy >> configured. >> >> >> **** >> >> What can be done about that? Is there any recommended way to >> stop/start >> ignite cluster to prevent such data loss problems? >> >> **** >> >> I see some similar fixed issues in Jira like >> https://issues.apache.org/jira/browse/IGNITE-12489 >> <https://issues.apache.org/jira/browse/IGNITE-12489> or >> https://issues.apache.org/jira/browse/IGNITE-14093 >> <https://issues.apache.org/jira/browse/IGNITE-14093> but >> looks like >> something still not working in 2.11.0. >> >> >> Thanks, >> >> -- >> >> Sergey >> >> >> >> > > > > > > > >