Re: [2.11.0]: 'B+Tree is corrupted' exception in GridCacheTtlManager.expire() and PartitionsEvictManager$PartitionEvictionTask.run() on node start

Denis Chudov Tue, 23 Nov 2021 11:10:55 -0800

Hi Sergey!

Thank you for providing details.
Have I understood correctly that you run newer version of Ignite on older
persistent files? Is there any possibility that some data in your caches
survived that 5 days of cluster work?
I'm just trying to exclude any compatibility problems, like
https://issues.apache.org/jira/browse/IGNITE-14252


Denis Chudov
Software Engineer, Moscow
+7 905 5775239
https://www.gridgain.com
Powered by Apache® Ignite™


On Tue, Nov 23, 2021 at 8:32 AM Sergey Korotkov <serge.korot...@gmail.com>
wrote:

> Hello Denis,
>
> Yes, as I said in the original message we do use the expiration on
> persistent caches.
>
> The corruptedPages_2021-11-08_11-03-21_999.txt and
> corruptedPages_2021-11-09_12-43-12_449.txt files were generated by Ignite
> on crash. They show that two different caches were affected. The first one
> during the expiration and the second (next day) during rebalance eviction.
> Both caches are persistent and use the expiration.
>
> I also run the diagnostic utility (IgniteWalConverter) the way it is
> recommended in the error message (output attached as diag-* files).
>
> Is there any usefull information in these diag-* files which can help to
> understand what and how was corruped in particular?
>
> ***
>
> Generally this was a test run of new 2.11.0 version in test environment. A
> goal was to check if new version works fine with out application and also
> can be safely stopped/started for maintenance. We do that since run into
> the similar problem with 'B+Tree is corrupted' on production during
> eviction rebalance (with 2.8.1).  We see two similar issues fixed in
> 2.11.0:  https://issues.apache.org/jira/browse/IGNITE-12489 and
> https://issues.apache.org/jira/browse/IGNITE-14093 and consider upgrade
> if it would help.   By the way fix of the IGNITE-12489 (
> https://github.com/apache/ignite/pull/8358/commits) contains a lot
> changes with several attempts. May be it just fixes not all situations?
>
> Before the deactivation cluster works fine under our usual load about 5
> days. Load is about 300 requests per second each consists of several reads
> and single write to caches with the expiration turned on.  After that we
> stop / start the cluster to emulate the situation we had on production with
> 2.8.1 (load from our application was stopped as well before the
> deactivation request).
>
> ***
>
> Caches configurations. The first one has an affinity key and interceptor
>
>   public CacheConfiguration<ContactHistoryKey, ContactHistory>
> getEdenContactHistoryCacheConfiguration() {
>     CacheConfiguration<ContactHistoryKey, ContactHistory>
> cacheConfiguration = new CacheConfiguration<>();
>     cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>     cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false,
> 1024));
>     cacheConfiguration.setBackups(1);
>     cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>     int expirationDays = appConfig.getContactHistoryEdenExpirationDays();
>     cacheConfiguration
>         .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
> Duration(TimeUnit.DAYS, expirationDays)));
>     cacheConfiguration.setInterceptor(new ContactHistoryInterceptor());
>     return cacheConfiguration;
>   }
>
> public class ContactHistoryKey {
>   String sOfferId;
>
>   @AffinityKeyMapped
>   String subsIdAffinityKey;
> }
>
>   CacheConfiguration<String, String> getChannelOfferIdCache() {
>     CacheConfiguration<String, String> cacheConfiguration = new
> CacheConfiguration<>();
>     cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>     cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false,
> 1024));
>     cacheConfiguration.setBackups(1);
>     cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>     int expirationDays = appConfig.getChannelOfferIdCacheExpirationDays();
>     cacheConfiguration
>         .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new
> Duration(TimeUnit.DAYS, expirationDays)));
>     return cacheConfiguration;
>   }
>
> ***
>
> As for the other details.  Not sure is it relevant or not. Deactivation
> was relatevily long and log contains a lot of warnings between  2021-11-08
> 10:54:44 and 2021-11-08 10:59:33.  Also there was a page locks dump at
> 10:56:47,567.  A lot of locks were logged for cache in the
> Batch_Campaigns_Region (region without the persistence).
>
> [2021-11-08
> 10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{}
> Threads hanged: [(sys-#99087-105322, WAITING)]
> [2021-11-08
> 10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{}
> Page locks dump:
>
> ................
>
> Thread=[name=sys-#99087, id=105322], state=WAITING
> Log overflow, size:512, headIdx=512 [structureId=7,
> pageIdpageId=281474976790149 [pageIdHex=0001000000013685, partId=0,
> pageIdx=79493, flags=00000001]]
> Locked pages =
> [844420635168174[0002ffff00000dae](r=0|w=1),844420635168175[0002ffff00000daf](r=0|w=1)]
> Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08
> 07:55:27.247)
> L=1 -> Write lock pageId=844420635168174,
> structureId=batch_campaign_results-p-0##CacheData
> [pageIdHex=0002ffff00000dae, partId=65535, pageIdx=3502, flags=00000002]
> L=2 -> Write lock pageId=844420635168175,
> structureId=batch_campaign_results-p-0##CacheData
> [pageIdHex=0002ffff00000daf, partId=65535, pageIdx=3503, flags=00000002]
> L=3 -> Write lock pageId=281474976716639,
> structureId=Batch_Campaigns_Region##FreeList [pageIdHex=000100000000175f,
> partId=0, pageIdx=5983, flags=00000001]
> L=4 -> Write lock pageId=844420635243234,
> structureId=Batch_Campaigns_Region##FreeList [pageIdHex=0002ffff000132e2,
> partId=65535, pageIdx=78562, flags=00000002]
>
> a lot more ...
>
> ......................
>
> Can it cause that something say not was checkpointed to the disk correctly?
>
> Thanks,
>
> --
>
>   Sergey
>
>
> 22.11.2021 23:18, Denis Chudov пишет:
>
> Hi Sergey!
>
> Seems that the root cause of your problem is possible corruption of data
> page
>
> Caused by: java.lang.IllegalStateException: Item not found: 3
> at
> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>
> and it happened twice, not during expiration cleanup, but also during
> rebalance eviction. But your scenario looks quite usual and couldn't cause
> such problems.
>
> I've found similar existing ticket
> https://issues.apache.org/jira/browse/IGNITE-11400
> but assignee couldn't reproduce the error.
>
> Could you please provide more information about what had happened on your
> cluster before and about cache configuration? Do you use expiration on
> persistent caches? Any details can be useful for the problem localization.
>
> Denis Chudov
> Software Engineer, Moscow
> +7 905 5775239
> https://www.gridgain.com
> Powered by Apache® Ignite™
>
>
> On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov <serge.korot...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We have troubles with the CorruptedTreeException: B+Tree is corrupted
>> during the node start after cluster restart. Looks like the caches with
>> the Expiry Policy configured are source of the problems.
>>
>> I have attached the log from the problem node.  The exact steps with
>> timestamps are as folows.  Before the deactivation cluster works fine
>> about 5 days
>>
>> 2021-11-08 10:54:44         cluster deactivate request
>>
>> 2021-11-08 10:59:33         cluster deactivated
>>
>> 2021-11-08 11:02:30         stop all nodes
>>
>> 2021-11-08 11:02:39         start all nodes
>>
>> 2021-11-08 11:03:14         auto-activation start
>>
>> 2021-11-08 11:03:16         cluster activated
>>
>> 2021-11-08 11:03:21         'B+Tree is corrupted' exception in
>> GridCacheTtlManager.expire() on one of the nodes (see the
>> 10.12.86.29-ignite-2021-11-08.0.log):
>>
>> [2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{}
>> Critical system error detected. Will be handled accordingly to
>> configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false,
>> timeout=0, super=AbstractFailureHandler
>> [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class
>> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree
>> is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on bounds:
>> [lower=null, upper=PendingRow []]]]]
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>> bounds: [lower=null, upper=PendingRow []]]
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193)
>>         at
>>
>> java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192)
>>         at
>> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>> java.lang.IllegalStateException: Item not found: 3
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>         ... 11 common frames omitted
>> Caused by:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>> java.lang.IllegalStateException: Item not found: 3
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>         ... 12 common frames omitted
>> Caused by: java.lang.IllegalStateException: Item not found: 3
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>         ... 22 common frames omitted
>>
>>
>> Error message suggests to run the
>> org.apache.ignite.development.utis.IgniteWalConverter to diagnose the
>> problem.  I attached the output of this utility:
>>
>>  - corruptedPages_2021-11-08_11-03-21_999.txt  - file created by ignite
>> on crash
>>
>>  - diag-2021-11-08.txt - output of the diagnostic utility.
>>
>>
>> Next day we try to start this node again and it still fail with  'B+Tree
>> is corrupted'  but in different place:
>>
>> 2021-11-09 12:37:34     'B+Tree is corrupted' exception in
>> PartitionsEvictManager$PartitionEvictionTask.run()  (see the
>> 10.12.86.29-ignite-2021-11-09.0.log)
>>
>>
>> [2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} Critical system
>> error detected. Will be handled accordingly to configured handler
>> [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>> super=AbstractFailu
>> reHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>> SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>> [type=CRITICAL_ERROR, err=class
>> o.a.i.i.processors.cache.persistence.tree.CorruptedT
>> reeException: B+Tree is corrupted [pages(groupId, pageId)=[],
>> msg=Runtime failure on bounds: [lower=null, upper=null]]]]
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>> B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>> bounds: [lower=null, upper=null]]
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938)
>>         at
>>
>> org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
>>         at
>>
>> org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409)
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>> java.lang.IllegalStateException: Item not found: 16
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>         ... 16 common frames omitted
>> Caused by:
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>> java.lang.IllegalStateException: Item not found: 16
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>         ... 17 common frames omitted
>> Caused by: java.lang.IllegalStateException: Item not found: 16
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>         at
>>
>> org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>
>>
>> I also attached the corrupted pages file and diagnostic output for this
>> case as well
>>
>> - corruptedPages_2021-11-09_12-43-12_449.txt
>>
>> - diag-2021-11-09.txt
>>
>>
>> In both cases pages are for the caches with the Expire Policy configured.
>>
>>
>> ****
>>
>> What can be done about that?  Is there any recommended way to stop/start
>> ignite cluster to prevent such data loss problems?
>>
>> ****
>>
>> I see some similar fixed issues in Jira like
>> https://issues.apache.org/jira/browse/IGNITE-12489 or
>> https://issues.apache.org/jira/browse/IGNITE-14093  but looks like
>> something still not working in 2.11.0.
>>
>>
>> Thanks,
>>
>> --
>>
>>   Sergey
>>
>>
>>
>>

Re: [2.11.0]: 'B+Tree is corrupted' exception in GridCacheTtlManager.expire() and PartitionsEvictManager$PartitionEvictionTask.run() on node start

Reply via email to