probably this is the case:  https://issues.apache.org/jira/browse/IGNITE-15990
 
   
>
>Hello Denis,
>Yes, as I said in the original message we do use the expiration on persistent 
>caches.
>The corruptedPages_2021-11-08_11-03-21_999.txt and 
>corruptedPages_2021-11-09_12-43-12_449.txt files were generated by Ignite on 
>crash. They show that two different caches were affected. The first one during 
>the expiration and the second (next day) during rebalance eviction.  Both 
>caches are persistent and use the expiration.
>I also run the diagnostic utility (IgniteWalConverter) the way it is 
>recommended in the error message (output attached as diag-* files). 
>Is there any usefull information in these diag-* files which can help to 
>understand what and how was corruped in particular?
>***
>Generally this was a test run of new 2.11.0 version in test environment. A 
>goal was to check if new version works fine with out application and also can 
>be safely stopped/started for maintenance. We do that since run into the 
>similar problem with 'B+Tree is corrupted' on production during eviction 
>rebalance (with 2.8.1).  We see two similar issues fixed in 2.11.0:   
>https://issues.apache.org/jira/browse/IGNITE-12489 and  
>https://issues.apache.org/jira/browse/IGNITE-14093 and consider upgrade if it 
>would help.   By the way fix of the IGNITE-12489 ( 
>https://github.com/apache/ignite/pull/8358/commits ) contains a lot changes 
>with several attempts. May be it just fixes not all situations?
>Before the deactivation cluster works fine under our usual load about 5 days. 
>Load is about 300 requests per second each consists of several reads and 
>single write to caches with the expiration turned on.  After that we stop / 
>start the cluster to emulate the situation we had on production with 2.8.1 
>(load from our application was stopped as well before the deactivation 
>request).
>***
>Caches configurations. The first one has an affinity key and interceptor
>  public CacheConfiguration<ContactHistoryKey, ContactHistory> 
>getEdenContactHistoryCacheConfiguration() {
>    CacheConfiguration<ContactHistoryKey, ContactHistory> cacheConfiguration = 
>new CacheConfiguration<>();
>    cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>    cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 
>1024));
>    cacheConfiguration.setBackups(1);
>    cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>    int expirationDays = appConfig.getContactHistoryEdenExpirationDays();
>    cacheConfiguration
>        .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new 
>Duration(TimeUnit.DAYS, expirationDays)));
>    cacheConfiguration.setInterceptor(new ContactHistoryInterceptor());
>    return cacheConfiguration;
>  }
>public class ContactHistoryKey {
>  String sOfferId;
>
>  @AffinityKeyMapped
>  String subsIdAffinityKey;
>}
>  CacheConfiguration<String, String> getChannelOfferIdCache() {
>    CacheConfiguration<String, String> cacheConfiguration = new 
>CacheConfiguration<>();
>    cacheConfiguration.setCacheMode(CacheMode.PARTITIONED);
>    cacheConfiguration.setAffinity(new RendezvousAffinityFunction(false, 
>1024));
>    cacheConfiguration.setBackups(1);
>    cacheConfiguration.setAtomicityMode(CacheAtomicityMode.ATOMIC);
>    int expirationDays = appConfig.getChannelOfferIdCacheExpirationDays();
>    cacheConfiguration
>        .setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(new 
>Duration(TimeUnit.DAYS, expirationDays)));
>    return cacheConfiguration;
>  }
>***
>As for the other details.  Not sure is it relevant or not. Deactivation was 
>relatevily long and log contains a lot of warnings between  2021-11-08 
>10:54:44 and 2021-11-08 10:59:33.  Also there was a page locks dump at 
>10:56:47,567.  A lot of locks were logged for cache in the  
>Batch_Campaigns_Region (region without the persistence).
>[2021-11-08 
>10:56:47,560][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} 
>Threads hanged: [(sys-#99087-105322, WAITING)]
>[2021-11-08 
>10:56:47,567][WARN][page-lock-tracker-timeout][CacheDiagnosticManager]{} Page 
>locks dump:
>................
>Thread=[name=sys-#99087, id=105322], state=WAITING
>Log overflow, size:512, headIdx=512 [structureId=7, 
>pageIdpageId=281474976790149 [pageIdHex=0001000000013685, partId=0, 
>pageIdx=79493, flags=00000001]]
>Locked pages = 
>[844420635168174[0002ffff00000dae](r=0|w=1),844420635168175[0002ffff00000daf](r=0|w=1)]
>Locked pages log: name=sys-#99087 time=(1636358127247, 2021-11-08 07:55:27.247)
>L=1 -> Write lock pageId=844420635168174, 
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=0002ffff00000dae, 
>partId=65535, pageIdx=3502, flags=00000002]
>L=2 -> Write lock pageId=844420635168175, 
>structureId=batch_campaign_results-p-0##CacheData [pageIdHex=0002ffff00000daf, 
>partId=65535, pageIdx=3503, flags=00000002]
>L=3 -> Write lock pageId=281474976716639, 
>structureId=Batch_Campaigns_Region##FreeList [pageIdHex=000100000000175f, 
>partId=0, pageIdx=5983, flags=00000001]
>L=4 -> Write lock pageId=844420635243234, 
>structureId=Batch_Campaigns_Region##FreeList [pageIdHex=0002ffff000132e2, 
>partId=65535, pageIdx=78562, flags=00000002]
>a lot more ...
>......................
>Can it cause that something say not was checkpointed to the disk correctly?
>Thanks,
>--
>  Sergey
> 
>22.11.2021 23:18, Denis Chudov пишет:
>>Hi Sergey!
>> 
>>Seems that the root cause of your problem is possible corruption of data page
>> 
>>Caused by: java.lang.IllegalStateException: Item not found: 3
>>at 
>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>> 
>>and it happened twice, not during expiration cleanup, but also during 
>>rebalance eviction. But your scenario looks quite usual and couldn't cause 
>>such problems.
>> 
>>I've found similar existing ticket  
>>https://issues.apache.org/jira/browse/IGNITE-11400
>>but assignee couldn't reproduce the error.
>> 
>>Could you please provide more information about what had happened on your 
>>cluster before and about cache configuration? Do you use expiration on 
>>persistent caches? Any details can be useful for the problem localization.
>> 
>>Denis Chudov
>>Software Engineer, Moscow
>>+7 905 5775239
>>https://www.gridgain.com
>>Powered by Apache® Ignite™  
>>On Thu, Nov 18, 2021 at 2:25 PM Sergey Korotkov < serge.korot...@gmail.com > 
>>wrote:
>>>Hello,
>>>
>>>We have troubles with the CorruptedTreeException: B+Tree is corrupted
>>>during the node start after cluster restart. Looks like the caches with
>>>the Expiry Policy configured are source of the problems.
>>>
>>>I have attached the log from the problem node.  The exact steps with
>>>timestamps are as folows.  Before the deactivation cluster works fine
>>>about 5 days
>>>
>>>2021-11-08 10:54:44         cluster deactivate request  
>>>
>>>2021-11-08 10:59:33         cluster deactivated
>>>
>>>2021-11-08 11:02:30         stop all nodes
>>>
>>>2021-11-08 11:02:39         start all nodes
>>>
>>>2021-11-08 11:03:14         auto-activation start
>>>
>>>2021-11-08 11:03:16         cluster activated
>>>
>>>2021-11-08 11:03:21         'B+Tree is corrupted' exception in
>>>GridCacheTtlManager.expire() on one of the nodes (see the
>>>10.12.86.29-ignite-2021-11-08.0.log):
>>>
>>>[2021-11-08 11:03:21,820][ERROR][ttl-cleanup-worker-#215][ROOT]{}
>>>Critical system error detected. Will be handled accordingly to
>>>configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false,
>>>timeout=0, super=AbstractFailureHandler
>>>[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>>SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>>>[type=CRITICAL_ERROR, err=class
>>>o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree
>>>is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on bounds:
>>>[lower=null, upper=PendingRow []]]]]
>>>org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>>>B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>>>bounds: [lower=null, upper=PendingRow []]]
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:3076)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:3023)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1255)
>>>        at
>>>org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246)
>>>        at
>>>org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:193)
>>>        at
>>>java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
>>>        at
>>>org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:192)
>>>        at
>>>org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
>>>        at java.lang.Thread.run(Thread.java:748)
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 3
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>>        ... 11 common frames omitted
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 3
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:123)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.PendingRow.initKey(PendingRow.java:73)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:127)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.PendingEntriesTree.getRow(PendingEntriesTree.java:32)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>>        ... 12 common frames omitted
>>>Caused by: java.lang.IllegalStateException: Item not found: 3
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>>        ... 22 common frames omitted
>>>
>>>
>>>Error message suggests to run the
>>>org.apache.ignite.development.utis.IgniteWalConverter to diagnose the
>>>problem.  I attached the output of this utility:
>>>
>>> - corruptedPages_2021-11-08_11-03-21_999.txt  - file created by ignite
>>>on crash
>>>
>>> - diag-2021-11-08.txt - output of the diagnostic utility.
>>>
>>>
>>>Next day we try to start this node again and it still fail with  'B+Tree
>>>is corrupted'  but in different place: 
>>>
>>>2021-11-09 12:37:34     'B+Tree is corrupted' exception in
>>>PartitionsEvictManager$PartitionEvictionTask.run()  (see the 
>>>10.12.86.29-ignite-2021-11-09.0.log)
>>>
>>>
>>>[2021-11-09 12:43:10,857][ERROR][rebalance-#344][ROOT]{} Critical system
>>>error detected. Will be handled accordingly to configured handler
>>>[hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
>>>super=AbstractFailu
>>>reHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
>>>SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
>>>[type=CRITICAL_ERROR, err=class
>>>o.a.i.i.processors.cache.persistence.tree.CorruptedT
>>>reeException: B+Tree is corrupted [pages(groupId, pageId)=[],
>>>msg=Runtime failure on bounds: [lower=null, upper=null]]]]
>>>org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>>>B+Tree is corrupted [pages(groupId, pageId)=[], msg=Runtime failure on
>>>bounds: [lower=null, upper=null]]
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6139)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1133)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:167)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.find(CacheDataTree.java:63)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1100)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1095)
>>>        at
>>>org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.cursor(IgniteCacheOffheapManagerImpl.java:2914)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.cursor(GridCacheOffheapManager.java:2856)
>>>        at
>>>org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$3.onHasNext(IgniteCacheOffheapManagerImpl.java:938)
>>>        at
>>>org.apache.ignite.internal.util.GridCloseableIteratorAdapter.hasNextX(GridCloseableIteratorAdapter.java:53)
>>>        at
>>>org.apache.ignite.internal.util.lang.GridIteratorAdapter.hasNext(GridIteratorAdapter.java:45)
>>>        at
>>>org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:987)
>>>        at
>>>org.apache.ignite.internal.processors.cache.distributed.dht.topology.PartitionsEvictManager$PartitionEvictionTask.run(PartitionsEvictManager.java:409)
>>>        at
>>>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>        at
>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>        at
>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>        at java.lang.Thread.run(Thread.java:748)
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 16
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1079)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1118)
>>>        ... 16 common frames omitted
>>>Caused by:
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTreeRuntimeException:
>>>java.lang.IllegalStateException: Item not found: 16
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:345)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:165)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:136)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.DataRow.<init>(DataRow.java:55)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataRowStore.dataRow(CacheDataRowStore.java:129)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:422)
>>>        at
>>>org.apache.ignite.internal.processors.cache.tree.CacheDataTree.getRow(CacheDataTree.java:63)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.fillFromBuffer0(BPlusTree.java:5820)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.fillFromBuffer(BPlusTree.java:5586)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.init(BPlusTree.java:5512)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findLowerUnbounded(BPlusTree.java:1068)
>>>        ... 17 common frames omitted
>>>Caused by: java.lang.IllegalStateException: Item not found: 16
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.findIndirectItemIndex(AbstractDataPageIO.java:476)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.getDataOffset(AbstractDataPageIO.java:584)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.tree.io.AbstractDataPageIO.readPayload(AbstractDataPageIO.java:626)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.readIncomplete(CacheDataRowAdapter.java:380)
>>>        at
>>>org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.doInitFromLink(CacheDataRowAdapter.java:316)
>>>
>>>
>>>I also attached the corrupted pages file and diagnostic output for this
>>>case as well
>>>
>>>- corruptedPages_2021-11-09_12-43-12_449.txt
>>>
>>>- diag-2021-11-09.txt
>>>
>>>
>>>In both cases pages are for the caches with the Expire Policy configured.
>>>
>>>
>>>****
>>>
>>>What can be done about that?  Is there any recommended way to stop/start
>>>ignite cluster to prevent such data loss problems?
>>>
>>>****
>>>
>>>I see some similar fixed issues in Jira like
>>>https://issues.apache.org/jira/browse/IGNITE-12489 or 
>>>https://issues.apache.org/jira/browse/IGNITE-14093   but looks like
>>>something still not working in 2.11.0.
>>>
>>>
>>>Thanks,
>>>
>>>--
>>>
>>>  Sergey
>>>
>>>
>>> 
>
>  
 
 
 
 

Reply via email to