Yakov, But we already have DFLT_SEND_RETRY_CNT and DFLT_SEND_RETRY_DELAY for configuring our CommunicationSPI behavior. What if user configure this parameters his own way and he will see a lot of WARN messages in log which have no sense?
May be we use GridCachePartitionExchangeManager#forceRebalance (or may be forceReassign) if we fail rebalance all that retries. What do you think? пн, 16 июл. 2018 г. в 21:12, Yakov Zhdanov <yzhda...@gridgain.com>: > Maxim, I looked at the code you provided. I think we need to add some > timeout validation and output warning to logs on demander side in case > there is no supply message within 30 secs and repeat demanding process. > This should apply to any demand message throughout the rebalancing process > not only the 1st one. > > You can use the following message > > Failed to wait for supply message from node within 30 secs [cache=C, > partId=XX] > > Alex Goncharuk do you have comments here? > > Yakov Zhdanov > www.gridgain.com > > 2018-07-14 19:45 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>: > > > Yakov, > > > > Yes, you're right. Whole rebalancing progress will be stopped. > > > > Actually, rebalancing order doesn't matter you right it too. Javadoc just > > says the idea how rebalance should work for caches but in fact it don't > > work as described. Personally, I'd prefer to start rebalance of each > cache > > group in async way independently. > > > > Please, look at my reproducer [1]. > > > > Scenario: > > Cluster with two REPLICATEDED caches. > > Start new node. > > First rebalance cache group is failed to start (e.g. network issues) - > it's > > OK. > > Second rebalance cache group will neber be started - whole futher > progress > > stucks (I think rebalance here should be started!). > > > > > > [1] > > https://github.com/Mmuzaf/ignite/blob/rebalance-cancel/ > > modules/core/src/test/java/org/apache/ignite/internal/ > > processors/cache/distributed/rebalancing/GridCacheRebalancingCancelSelf > > Test.java > > > > пт, 13 июл. 2018 г. в 17:46, Yakov Zhdanov <yzhda...@apache.org>: > > > > > Maxim, I do not understand the problem. Imagine I do not have any > > ordering > > > but rebalancing of some cache fails to start - so in my understanding > > > overall rebalancing progress becomes blocked. Is that true? > > > > > > Can you pleaes provide reproducer for your problem? > > > > > > --Yakov > > > > > > 2018-07-09 16:42 GMT+03:00 Maxim Muzafarov <maxmu...@gmail.com>: > > > > > > > Hello Igniters, > > > > > > > > Each cache group has “rebalance order” property. As javadoc for > > > > getRebalanceOrder() says: “Note that cache with order {@code 0} does > > not > > > > participate in ordering. This means that cache with rebalance order > > > {@code > > > > 0} will never wait for any other caches. All caches with order {@code > > 0} > > > > will be rebalanced right away concurrently with each other and > ordered > > > > rebalance processes. If not set, cache order is 0, i.e. rebalancing > is > > > not > > > > ordered.” > > > > > > > > In fact GridCachePartitionExchangeManager always build the chain of > > > > rebalancing cache groups to start (even for cache order ZERO): > > > > > > > > ignite-sys-cache -> cacheR -> cacheR3 -> cacheR2 -> cacheR5 -> > cacheR1. > > > > > > > > If one of these groups will fail to start further groups will never > be > > > run. > > > > > > > > * Question 1*: Should we fix javadoc description or create a bug for > > > fixing > > > > such rebalance behavior? > > > > > > > > [1] > > > > https://github.com/apache/ignite/blob/master/modules/ > > > > core/src/main/java/org/apache/ignite/internal/processors/cache/ > > > > GridCachePartitionExchangeManager.java#L2630 > > > > > > > > > -- > > -- > > Maxim Muzafarov > > > -- -- Maxim Muzafarov