probably this is the case: https://issues.apache.org/jira/browse/IGNITE-15990
>
>Hello Denis,
>Yes, as I said in the original message we do use the expiration on persistent
>caches.
>The corruptedPages_2021-11-08_11-03-21_999.txt and
>corruptedPages_2021-11-09_12-43-12_449.txt files were generated by Ignite on
>crash. They show that two different caches were affected. The first one during
>the expiration and the second (next day) during rebalance eviction. Both
>caches are persistent and use the expiration.
>I also run the diagnostic utility (IgniteWalConverter) the way it is
>recommended in the error message (output attached as diag-* files).
>Is there any usefull information in these diag-* files which can help to
>understand what and how was corruped in particular?
>***
>Generally this was a test run of new 2.11.0 version in test environment. A
>goal was to check if new version works fine with out application and also can
>be safely stopped/started for maintenance. We do that since run into the
>similar problem with 'B+Tree is corrupted' on production during eviction
>rebalance (with 2.8.1). We see two similar issues fixed in 2.11.0:
>https://issues.apache.org/jira/browse/IGNITE-12489 and
>https://issues.apache.org/jira/browse/IGNITE-14093 and consider upgrade if it
>would help. By the way fix of the IGNITE-12489 (
>https://github.com/apache/ignite/pull/8358/commits ) contains a lot changes
>with several attempts. May be it just fixes not all situations?
>Before the deactivation cluster works fine under our usual load about 5 days.
>Load is about 300 requests per second each consists of several reads and
>single write to caches with the expiration turned on. After that we stop /
>start the cluster to emulate the situation we had on production with 2.8.1
>(load from our application was stopped as well before the deactivation
>request).
>***
>Caches configurations. The first one has an affinity key and interceptor
> public CacheConfiguration<ContactHistoryKey, ContactHistory>
>getEdenContactHistoryCacheConfiguration() {
> CacheConfiguration<ContactHistoryKey, ContactHistory> cacheConfiguration =
>new CacheConfiguration<>();
> cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
> cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false,
>1024));
> cacheConfiguration.setBackups(1);
> cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
> int expirationDays = appConfig.getContactHistoryEdenExpirationDays();
> cacheConfiguration
> .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
>Duration(TimeUnit.DAYS, expirationDays)));
> cacheConfiguration.setInterceptor(new ContactHistoryInterceptor());
> return cacheConfiguration;
> }
>public class ContactHistoryKey {
> String sOfferId;
>
> @AffinityKeyMapped
> String subsIdAffinityKey;
>}
> CacheConfiguration<String, String> getChannelOfferIdCache() {
> CacheConfiguration<String, String> cacheConfiguration = new
>CacheConfiguration<>();
> cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
> cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false,
>1024));
> cacheConfiguration.setBackups(1);
> cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
> int expirationDays = appConfig.getChannelOfferIdCacheExpirationDays();
> cacheConfiguration
> .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
>Duration(TimeUnit.DAYS, expirationDays)));
> return cacheConfiguration;
> }
>***
>As for the other details. Not sure is it relevant or not. Deactivation was
>relatevily long and log contains a lot of warnings between 2021-11-08
>10:54:44 and 2021-11-08 10:59:33. Also there was a page locks dump at
>10:56:47,567. A lot of locks were logged for cache in the
>Batch_Campaigns_Region (region without the persistence).
>[2021-11-08
>10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{}
>Threads hanged: [(sys-#99087-105322, WAITING)]
>[2021-11-08
>10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} Page
>locks dump:
>................
>Thread=[name=sys-#99087, id=105322], state=WAITING
>Log overflow, size:512, headIdx=512 [structureId=7,
>pageIdpageId=281474976790149 [pageIdHex=0001000000013685, partId=0,
>pageIdx=79493, flags=00000001]]
>Locked pages =
>[844420635168174[0002ffff00000dae](r=0|w=1),844420635168175[0002ffff00000daf](r=0|w=1)]
>Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08 07:55:27.247)
>L=1 -> Write lock pageId=844420635168174,
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=0002ffff00000dae,
>partId=65535, pageIdx=3502, flags=00000002]
>L=2 -> Write lock pageId=844420635168175,
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=0002ffff00000daf,
>partId=65535, pageIdx=3503, flags=00000002]
>L=3 -> Write lock pageId=281474976716639,
>structureId=Batch_Campaigns_Region##FreeList [pageIdHex=000100000000175f,
>partId=0, pageIdx=5983, flags=00000001]
>L=4 -> Write lock pageId=844420635243234,
>structureId=Batch_Campaigns_Region##FreeList [pageIdHex=0002ffff000132e2,
>partId=65535, pageIdx=78562, flags=00000002]
>a lot more ...
>......................
>Can it cause that something say not was checkpointed to the disk correctly?
>Thanks,
>--
> Sergey
>
>22.11.2021 23:18, Denis Chudov пишет:
>>Hi Sergey!
>>
>>Seems that the root cause of your problem is possible corruption of data page
>>
>>Caused by: java.lang.IllegalStateException: Item not found: 3
>>at
>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>
>>and it happened twice, not during expiration cleanup, but also during
>>rebalance eviction. But your scenario looks quite usual and couldn't cause
>>such problems.
>>
>>I've found similar existing ticket
>>https://issues.apache.org/jira/browse/IGNITE-11400
>>but assignee couldn't reproduce the error.
>>
>>Could you please provide more information about what had happened on your
>>cluster before and about cache configuration? Do you use expiration on
>>persistent caches? Any details can be useful for the problem localization.
>>
>>Denis Chudov
>>Software Engineer, Moscow
>>+7 905 5775239
>>https://www.gridgain.com
>>Powered by Apache® Ignite™
>>On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov < serge.korot...@gmail.com >
>>wrote:
>>>Hello,
>>>
>>>We have troubles with the CorruptedTreeException: B+Tree is corrupted
>>>during the node start after cluster restart. Looks like the caches with
>>>the Expiry Policy configured are source of the problems.
>>>
>>>I have attached the log from the problem node. The exact steps with
>>>timestamps are as folows. Before the deactivation cluster works fine
>>>about 5 days
>>>
>>>2021-11-08 10:54:44 cluster deactivate request
>>>
>>>2021-11-08 10:59:33 cluster deactivated
>>>
>>>2021-11-08 11:02:30 stop all nodes
>>>
>>>2021-11-08 11:02:39 start all nodes
>>>
>>>2021-11-08 11:03:14 auto-activation start
>>>
>>>2021-11-08 11:03:16 cluster activated
>>>
>>>2021-11-08 11:03:21 'B+Tree is corrupted' exception in
>>>GridCacheTtlManager.expire() on one of the nodes (see the
>>>10.12.86.29-ignite-2021-11-08.0.log):
>>>
>>>[2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{}
>>>Critical system error detected. Will be handled accordingly to
>>>configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false,
>>>timeout=0, super=AbstractFailureHandler
>>>[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>>SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>>>[type=CRITICAL_ERROR, err=class
>>>o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree
>>>is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on bounds:
>>>[lower=null, upper=PendingRow []]]]]
>>>org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>>>B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>>>bounds: [lower=null, upper=PendingRow []]]
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255)
>>> at
>>>org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246)
>>> at
>>>org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193)
>>> at
>>>java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>>> at
>>>org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192)
>>> at
>>>org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>>> at java.lang.Thread.run(Thread.java:748)
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 3
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>> ... 11 common frames omitted
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 3
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>> ... 12 common frames omitted
>>>Caused by: java.lang.IllegalStateException: Item not found: 3
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>> ... 22 common frames omitted
>>>
>>>
>>>Error message suggests to run the
>>>org.apache.ignite.development.utis.IgniteWalConverter to diagnose the
>>>problem. I attached the output of this utility:
>>>
>>> - corruptedPages_2021-11-08_11-03-21_999.txt - file created by ignite
>>>on crash
>>>
>>> - diag-2021-11-08.txt - output of the diagnostic utility.
>>>
>>>
>>>Next day we try to start this node again and it still fail with 'B+Tree
>>>is corrupted' but in different place:
>>>
>>>2021-11-09 12:37:34 'B+Tree is corrupted' exception in
>>>PartitionsEvictManager$PartitionEvictionTask.run() (see the
>>>10.12.86.29-ignite-2021-11-09.0.log)
>>>
>>>
>>>[2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} Critical system
>>>error detected. Will be handled accordingly to configured handler
>>>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>super=AbstractFailu
>>>reHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>>SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>>>[type=CRITICAL_ERROR, err=class
>>>o.a.i.i.processors.cache.persistence.tree.CorruptedT
>>>reeException: B+Tree is corrupted [pages(groupId, pageId)=[],
>>>msg=Runtime failure on bounds: [lower=null, upper=null]]]]
>>>org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>>>B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>>>bounds: [lower=null, upper=null]]
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>> at
>>>org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856)
>>> at
>>>org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938)
>>> at
>>>org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
>>> at
>>>org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>>> at
>>>org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987)
>>> at
>>>org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409)
>>> at
>>>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at
>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> at
>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> at java.lang.Thread.run(Thread.java:748)
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 16
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>> ... 16 common frames omitted
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 16
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422)
>>> at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>> ... 17 common frames omitted
>>>Caused by: java.lang.IllegalStateException: Item not found: 16
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>> at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>>
>>>
>>>I also attached the corrupted pages file and diagnostic output for this
>>>case as well
>>>
>>>- corruptedPages_2021-11-09_12-43-12_449.txt
>>>
>>>- diag-2021-11-09.txt
>>>
>>>
>>>In both cases pages are for the caches with the Expire Policy configured.
>>>
>>>
>>>****
>>>
>>>What can be done about that? Is there any recommended way to stop/start
>>>ignite cluster to prevent such data loss problems?
>>>
>>>****
>>>
>>>I see some similar fixed issues in Jira like
>>>https://issues.apache.org/jira/browse/IGNITE-12489 or
>>>https://issues.apache.org/jira/browse/IGNITE-14093 but looks like
>>>something still not working in 2.11.0.
>>>
>>>
>>>Thanks,
>>>
>>>--
>>>
>>> Sergey
>>>
>>>
>>>
>
>