Sergey Korotkov created IGNITE-24992:
----------------------------------------

             Summary: Hang in put() and starvation in sys-striped pool if 
RANDOM_2_LRU eviction policy is used
                 Key: IGNITE-24992
                 URL: https://issues.apache.org/jira/browse/IGNITE-24992
             Project: Ignite
          Issue Type: Bug
            Reporter: Sergey Korotkov
         Attachments: Random2LruPageEvictionPutLargeObjectsTest.java

In-memory cluster.

RANDOM_2_LRU eviction policy is applied.

Put of large objects which occupy several pages can hang in cycle in 
IgniteCacheDatabaseSharedManager.ensureFreeSpace() since  
Random2LruPageEvictionTracker.evictDataPage() keeps failing to find the page to 
evict.

The immediate reason is that RANDOM_2_LRU approach can only evict pages "with 
at least one touch".  For large (fragmented) objects only the last page is 
touched (see the PageEvictionTracker.touchPage()  call in 
AbstractFreeList#WriteRowHandler.addRow() method). So if only large objects 
exist data region has very very small fraction of the "touched" pages 
appropriate for eviction.  It appears that 5000 random attempts are not enough 
to get 5 candidate pages to evict.  So 
Random2LruPageEvictionTracker.evictDataPage() fails.

Reproducer is attached [^Random2LruPageEvictionPutLargeObjectsTest.java]. 

It hangs after 12th put.

 

***

System striped pool can starvate for a long time (upto 14 hours once in real 
production environment until nodes were manually restarted) with the following 
errors logged:
{noformat}
[2025-04-02T16:34:23,108][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteCacheDatabaseSharedManager]
 Page-based evictions started. Consider increasing 'maxSize' on Data Region 
configuration: default
[2025-04-02T16:34:23,110][WARN 
][sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%][Random2LruPageEvictionTracker]
 Too many attempts to choose data page: 5000

....

[2025-04-02T16:34:48,277][ERROR][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][G]
 Blocked system-critical thread has been detected. This can lead to 
cluster-wide undefined behaviour [workerName=sys-stripe-7, 
threadName=sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%, 
blockedFor=25s]
[2025-04-02T16:34:48,278][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][IgniteTestResources]
 Possible failure suppressed accordingly to a configured handler 
[hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext 
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker 
[name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]]]
 org.apache.ignite.IgniteException: GridWorker [name=sys-stripe-7, 
igniteInstanceName=paged.Random2LruPageEvictionPutLargeObjectsTest1, 
finished=false, heartbeatTs=1743586463106]
        at 
org.apache.ignite.internal.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
 ~[classes/:?]
        at 
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
 ~[classes/:?]
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
 ~[classes/:?]
        at 
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1498)
 ~[classes/:?]
        at 
org.apache.ignite.internal.managers.communication.GridIoManager$9.execute(GridIoManager.java:1402)
 ~[classes/:?]
        at 
org.apache.ignite.internal.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
 ~[classes/:?]
        at 
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:637)
 ~[classes/:?]
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125) 
~[classes/:?]
        at java.base/java.lang.Thread.run(Thread.java:829) ~[?:?]
[2025-04-02T16:34:48,279][WARN ][tcp-disco-msg-worker-[15e5f20b 
127.0.0.1:47500]-#8%paged.Random2LruPageEvictionPutLargeObjectsTest1%-#92%paged.Random2LruPageEvictionPutLargeObjectsTest1%][CacheDiagnosticManager]
 Page locks dump:

[2025-04-02T16:34:48,661][WARN 
][grid-timeout-worker-#76%paged.Random2LruPageEvictionPutLargeObjectsTest1%][PoolProcessor]
 >>> Possible starvation in striped pool.
    Thread name: 
sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%
    Queue: []
    Deadlock: false
    Completed: 2
Thread 
[name="sys-stripe-7-#62%paged.Random2LruPageEvictionPutLargeObjectsTest1%", 
id=80, state=RUNNABLE, blockCnt=0, waitCnt=3]
        at 
app//o.a.i.i.processors.cache.persistence.evict.Random2LruPageEvictionTracker.evictDataPage(Random2LruPageEvictionTracker.java:152)
        at 
app//o.a.i.i.processors.cache.persistence.IgniteCacheDatabaseSharedManager.ensureFreeSpace(IgniteCacheDatabaseSharedManager.java:1251)
        at 
app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1812)
        at 
app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1732)
        at 
app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3225)
        at 
app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:271)
        at 
app//o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$3.apply(GridDhtAtomicCache.java:266)
        at 
app//o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1096)
        at 
app//o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:597)
        at 
app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:398)
        at 
app//o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:316)
        at 
app//o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:306)
        at 
app//o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1877)
        at 
app//o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1498)
        at 
app//o.a.i.i.managers.communication.GridIoManager$9.execute(GridIoManager.java:1402)
        at 
app//o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:55)
        at 
app//o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:637)
        at app//o.a.i.i.util.worker.GridWorker.run(GridWorker.java:125)
        at java.base@11.0.26/java.lang.Thread.run(Thread.java:829)
{noformat}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to