Vladimir Pligin created IGNITE-14248:
----------------------------------------

             Summary: Handle exceptions in 
PartitionReservationManager.onDoneAfterTopologyUnlock properly
                 Key: IGNITE-14248
                 URL: https://issues.apache.org/jira/browse/IGNITE-14248
             Project: Ignite
          Issue Type: Improvement
          Components: cache
    Affects Versions: 2.9.1
            Reporter: Vladimir Pligin


If an exception (or even Error) is thrown inside of the method then the node 
turns into some unrecoverable state. Here's an example.
 # an exchange is about to finish, it's time to invalidate partition 
reservations.
 # exchange thread delegates it to a thread in the management pool
 # management pool tries to allocate a new thread (maybe it's idle and 
therefore empty)
 # for example ulimit is reached, the error is 
java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
 # It's being logged, no further action is taken
 # partitions are reserved forever

Message:
2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR 
o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start 
reservations cleanup
java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:803)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
        at 
org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
        at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.base/java.lang.Thread.run(Thread.java:834)
 

Code of PartitionReservationManager.onDoneAfterTopologyUnlock:

 
{code:java}
@Override public void onDoneAfterTopologyUnlock(final 
GridDhtPartitionsExchangeFuture fut) {
        try {
            // Must not do anything at the exchange thread. Dispatch to the 
management thread pool.
            ctx.closure().runLocal(() -> {
                    AffinityTopologyVersion topVer = 
ctx.cache().context().exchange()
                        
.lastAffinityChangedTopologyVersion(fut.topologyVersion());                    
reservations.forEach((key, r) -> {
                        if (r != REPLICATED_RESERVABLE && 
!F.eq(key.topologyVersion(), topVer)) {
                            assert r instanceof GridDhtPartitionsReservation;   
                         ((GridDhtPartitionsReservation)r).invalidate();
                        }
                    });
                },
                GridIoPolicy.MANAGEMENT_POOL);
        }
        catch (Throwable e) {
            log.error("Unexpected exception on start reservations cleanup", e);
        }
    }
{code}
 

 

My vision is there are two basic approaches:
 * to kill the node (it's already non-functional at this point)
 * try to recover somehow (to be honest it's not clear how exactly)

This particular OOM situation seems unrecoverable in fact. It's a environment 
misconfiguration. It would be great to investigate if potentially recoverable 
exceptions are possible to be raised inside this block. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to