[ 
https://issues.apache.org/jira/browse/IGNITE-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Tikhonov updated IGNITE-4798:
-------------------------------------
    Description: 
 I managed to reproduce the stability issue we've been having in production in 
a relatively sterile environment.

The situation is:
1. Startup a cluster of 223 nodes.
2. Wait for everything to stabilize (took about 2 minutes).
3. Shut down 112 nodes.
4. Wait for everything to stabilize..

Since that point, I can't connect client nodes to the cluster:
2017-02-15 23:13:16.396 WARN  o.a.i.i.p.c.GridCachePartitionExchangeManager 
main                 ctx:             actor:             - Failed to wait for 
initial partition map exchange. Possible reasons are:
  ^-- Transactions in deadlock.
  ^-- Long running transactions (ignore if this is the case).
  ^-- Unreleased explicit locks.

Other cache operations are also stuck.

 

  was:
   
Hi Valentin,

I managed to reproduce the stability issue we've been having in production in a 
relatively sterile environment.
The logs and stack traces are accessible here: 
https://drive.google.com/open?id=0B1YMrCiHZq1PMWJsblBYSXhaX1k

The situation is:
1. Startup a cluster of 223 nodes.
2. Wait for everything to stabilize (took about 2 minutes).
3. Shut down 112 nodes.
4. Wait for everything to stabilize..

Since that point, I can't connect client nodes to the cluster:
2017-02-15 23:13:16.396 WARN  o.a.i.i.p.c.GridCachePartitionExchangeManager 
main                 ctx:             actor:             - Failed to wait for 
initial partition map exchange. Possible reasons are:
  ^-- Transactions in deadlock.
  ^-- Long running transactions (ignore if this is the case).
  ^-- Unreleased explicit locks.

Other cache operations are also stuck.

Let me know what other information I can provide.

 


> Cluster does not finish rebalancing after nodes leaving
> -------------------------------------------------------
>
>                 Key: IGNITE-4798
>                 URL: https://issues.apache.org/jira/browse/IGNITE-4798
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Kholodov
>
>  I managed to reproduce the stability issue we've been having in production 
> in a relatively sterile environment.
> The situation is:
> 1. Startup a cluster of 223 nodes.
> 2. Wait for everything to stabilize (took about 2 minutes).
> 3. Shut down 112 nodes.
> 4. Wait for everything to stabilize..
> Since that point, I can't connect client nodes to the cluster:
> 2017-02-15 23:13:16.396 WARN  o.a.i.i.p.c.GridCachePartitionExchangeManager 
> main                 ctx:             actor:             - Failed to wait for 
> initial partition map exchange. Possible reasons are:
>   ^-- Transactions in deadlock.
>   ^-- Long running transactions (ignore if this is the case).
>   ^-- Unreleased explicit locks.
> Other cache operations are also stuck.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to