Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Ilya Kasnacheev Fri, 01 May 2020 02:33:23 -0700

Hello!

This description sounds like a typical hanging Partition Map Exchange, but
you should be able to see that in logs.
If you don't, you can collect thread dumps from all nodes with jstack and
check it for any stalling operations (or share with us).


Regards,
-- 
Ilya Kasnacheev


пт, 1 мая 2020 г. в 11:53, userx <[email protected]>:

> Hi Pavel,
>
> I am using 2.8 and still getting the same issue. Here is the ecosystem
>
> 19 Ignite servers (S1 to S19) running at 16GB of max JVM and in persistent
> mode.
>
> 96 Clients (C1 to C96)
>
> There are 19 machines, 1 Ignite server is started on 1 machine. The clients
> are evenly distributed across machines.
>
> C19 tries to create a cache, it gets a timeout exception as i have 5 mins
> of
> timeout. When I looked into the coordinator logs, between a span of 5
> minutes, it gets the messages
>
>
> 2020-04-24 15:37:09,434 WARN [exchange-worker-#45%S1%] {}
>
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture
> - Unable to await partitions release latch within timeout. Some nodes have
> not sent acknowledgement for latch completion. It's possible due to
> unfinishined atomic updates, transactions or not released explicit locks on
> that nodes. Please check logs for errors on nodes with ids reported in
> latch
> `pendingAcks` collection [latch=ServerLatch [permits=4, pendingAcks=HashSet
> [84b8416c-fa06-4544-9ce0-e3dfba41038a,
> 19bd7744-0ced-4123-a35f-ddf0cf9f55c4,
> 533af8f9-c0f6-44b6-92d4-658f86ffaca0,
> 1b31cb25-abbc-4864-88a3-5a4df37a0cf4],
> super=CompletableLatch [id=CompletableLatchUid [id=exchange,
> topVer=AffinityTopologyVersion [topVer=174, minorTopVer=1]]]]]
>
> And the 4 nodes which have not been able to acknowledge latch completion
> are
> S14, S7, S18, S4
>
> I went to see the logs of S4, it just records the addition of C19 into
> topology and then C19 leaving it after 5 minutes. The only thing is that in
> GC I see this consistently "Total time for which application threads were
> stopped: 0.0006225 seconds, Stopping threads took: 0.0000887 seconds"
>
> I understand that until the time all the atomic updates and transactions
> are
> finished Clients are not able to create caches by communicating with
> Coordinator but is there a way around ?
>
> So the question is that is it still prevalent on 2.8 ?
>
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

Reply via email to