Sergey Korotkov created IGNITE-24992: ----------------------------------------
Summary: Hang in put() and starvation in sys-striped pool if RANDOM_2_LRU eviction policy is used Key: IGNITE-24992 URL: https://issues.apache.org/jira/browse/IGNITE-24992 Project: Ignite Issue Type: Bug Reporter: Sergey Korotkov Attachments: Random2LruPageEvictionPutLargeObjectsTest.java In-memory cluster. RANDOM_2_LRU eviction policy is applied. Put of large objects which occupy several pages can hang in cycle in IgniteCacheDatabaseSharedManager.ensureFreeSpace() since Random2LruPageEvictionTracker.evictDataPage() keeps failing to find the page to evict. The immediate reason is that RANDOM_2_LRU approach can only evict pages "with at least one touch". For large (fragmented) objects only the last page is touched (see the PageEvictionTracker.touchPage() call in AbstractFreeList#WriteRowHandler.addRow() method). So if only large objects exist data region has very very small fraction of the "touched" pages appropriate for eviction. It appears that 5000 random attempts are not enough to get 5 candidate pages to evict. So Random2LruPageEvictionTracker.evictDataPage() fails. Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. It hangs after 12th put. *** System striped pool can starvate for a long time (upto 14 hours once in real production environment until nodes were manually restarted) with the following errors logged: {noformat} [2025-04-02T16:34:23,108][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager] Page-based evictions started. Consider increasing 'maxSize' on Data Region configuration: default [2025-04-02T16:34:23,110][WARN ][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker] Too many attempts to choose data page: 5000 .... [2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, blockedFor=25s] [2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources] Possible failure suppressed accordingly to a configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106]]] org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, finished=false, heartbeatTs=1743586463106] at org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) ~[classes/:?] at org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) ~[classes/:?] at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) ~[classes/:?] at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1498) ~[classes/:?] at org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1402) ~[classes/:?] at org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55) ~[classes/:?] at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:637) ~[classes/:?] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) ~[classes/:?] at java.base/java.lang.Thread.run(Thread.java:829) ~[?:?] [2025-04-02T16:34:48,279][WARN ][tcp-disco-msg-worker-[15e5f20b 127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][CacheDiagnosticManager] Page locks dump: [2025-04-02T16:34:48,661][WARN ][grid-timeout-worker-#76%paged.Random2LruPageEvictionPutLargeObjectsTest1%][PoolProcessor] >>> Possible starvation in striped pool. Thread name: sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1% Queue: [] Deadlock: false Completed: 2 Thread [name="sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%", id=80, state=RUNNABLE, blockCnt=0, waitCnt=3] at app//o.a.i.i.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152) at app//o.a.i.i.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251) at app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812) at app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732) at app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225) at app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271) at app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266) at app//o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096) at app//o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597) at app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398) at app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316) at app//o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306) at app//o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877) at app//o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1498) at app//o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1402) at app//o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55) at app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:637) at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125) at java.base@11.0.26/java.lang.Thread.run(Thread.java:829) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)