[jira] [Commented] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test

Cong Guo (Jira) Mon, 08 Jun 2020 13:32:13 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128626#comment-17128626
 ]


Cong Guo commented on IGNITE-3212:
----------------------------------

This issue still exists. Do you have any clue?

We are using Ignite 2.7.0 and seeing this issue sometimes. We have three server 
nodes embedded in our application nodes. For example, with three nodes A, B, C, 
when we restart A,  the whole cluster gets stuck during partition map exchange.

A logs:

"Failed to wait for initial partition map exchange."

then repeats forever:

"Failed to wait for partition map exchange" and a log of dump

The coordinator is C. C logs:

"Unable to await partitions release latch within timeout: ServerLatch 
[permits=1, pendingAcks=[070d58f2-a1a8-4b5f-b986-06a63ac18fb2], 
super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=87, 
minorTopVer=0]]]"

and "Failed to roll back transaction (cache may contain stale locks)"

The pending ACK should be from B (070d58f2-a1a8-4b5f-b986-06a63ac18fb2).

B logs:

"Failed to wait for partition release future" and dumps many pending 
transactions.

Our transaction time out is 35 seconds, but the duration of many of these 
transactions is much longer than 35 seconds, and the transaction state is 
MARKED_ROLLBACK and timeout=true.

 

 

 

 

 

 

 

 

 

> Servers get stuck with the warning "Failed to wait for initial partition map 
> exchange" during falover test
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-3212
>                 URL: https://issues.apache.org/jira/browse/IGNITE-3212
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Ksenia Rybakova
>            Priority: Critical
>             Fix For: 3.0, 2.9
>
>
> Servers being restarted during failover test get stuck after some time with 
> the warning "Failed to wait for initial partition map exchange". 
> {noformat}
> [08:44:41,303][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Added new node to topology: TcpDiscoveryNode 
> [id=db557f04-43b7-4e28-ae0d-d4dcf4139c89, addrs=
> [10.20.0.222, 127.0.0.1], sockAddrs=[fosters-222/10.20.0.222:47503, 
> /10.20.0.222:47503, /127.0.0.1:47503], discPort=47503, order=44, intOrder=32, 
> lastExchangeTime=1464
> 363880917, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:44:41,304][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Topology snapshot [ver=44, servers=19, clients=1, CPUs=64, heap=160.0GB]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Added new node to topology: TcpDiscoveryNode 
> [id=6fae61a7-c1c1-40e5-8ad0-8bf5d6c86eb7, addrs=
> [10.20.0.223, 127.0.0.1], sockAddrs=[fosters-223/10.20.0.223:47503, 
> /10.20.0.223:47503, /127.0.0.1:47503], discPort=47503, order=45, intOrder=33, 
> lastExchangeTime=1464
> 363910999, loc=false, ver=1.6.0#20160525-sha1:48321a40, isClient=false]
> [08:45:11,455][INFO ][disco-event-worker-#80%null%][GridDiscoveryManager] 
> Topology snapshot [ver=45, servers=20, clients=1, CPUs=64, heap=170.0GB]
> [08:45:19,942][INFO ][ignite-update-notifier-timer][GridUpdateNotifier] 
> Update status is not available.
> [08:46:20,370][WARN ][main][GridCachePartitionExchangeManager] Failed to wait 
> for initial partition map exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
> [08:48:30,375][WARN ][main][GridCachePartitionExchangeManager] Still waiting 
> for initial partition map exchange ...
> {noformat}
> "Failed to wait for partition release future" warnings are on other nodes.
> {noformat}
> [08:09:45,822][WARN 
> ][exchange-worker-#82%null%][GridDhtPartitionsExchangeFuture] Failed to wait 
> for partition release future [topVer=AffinityTopologyVersion [topVer=29, 
> minorTopVer=0], node=cab5d0e0-7365-4774-8f99-d9f131c5d896]. Dumping pending 
> objects that might be the cause:
> [08:09:45,822][WARN 
> ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Ready 
> affinity version: AffinityTopologyVersion [topVer=28, minorTopVer=1]
> [08:09:45,826][WARN 
> ][exchange-worker-#82%null%][GridCachePartitionExchangeManager] Last exchange 
> future: GridDhtPartitionsExchangeFuture ...
> {noformat}
> Load config:
> - 1 client, 20 servers (5 servers per 1 host)
> - warmup 60
> - duration 66h
> - preload 5M
> - key range 10M
> - operations: PUT PUT_ALL GET GET_ALL INVOKE INVOKE_ALL REMOVE REMOVE_ALL 
> PUT_IF_ABSENT REPLACE
> - backups count 3
> - 3 servers restart every 15 min with 30 sec step, pause between stop and 
> start 5min



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-3212) Servers get stuck with the warning "Failed to wait for initial partition map exchange" during falover test

Reply via email to