[ https://issues.apache.org/jira/browse/IGNITE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951275#comment-14951275 ]
Anton Vinogradov commented on IGNITE-1093: ------------------------------------------ Worked on fixing duplicate partition eviction attempts happens at the end of the rebalancing. In case cache contains 20m entries each node evicts 2gb of data when third nide join topology. As a result - there is throughput problems at the end of rebalancing. Solution that works is to use singlethreadpool to evict partitions. In this case eviction is not so agressive and cause not so long pauses. But this solution gives no 100% guarantie that gc pause will be less than failureDetectionTimeout and supply node will not left topology. Second checked solution is to split each partition's eviction to small parts, for example no more than 1000 entries. This will give chances for other callable submitted to system pool to be evecuted without long delays. This scheme workes sometimes, but still have good chancess to have big gc pauses. I checked this solution and it passes 3 of 3 attemps, but when I decided to recheck it becomes to fail again. I think that I tryed to solve result instead of reason. Currently partitions evicted in a bulk way, for example using checkEvictions method, that way seems to be reason of long gc pauses. Will be better to rent partitions step-by-step at EVT_CACHE_REBALANCE_PART_LOADED event. In this case there will be less chances to have long gc pauses, uniformity in this case will be guaranted by rebalancing process that use limited threads count. Thoughts? > Rebalancing with default parameters is very slow > ------------------------------------------------ > > Key: IGNITE-1093 > URL: https://issues.apache.org/jira/browse/IGNITE-1093 > Project: Ignite > Issue Type: Bug > Components: cache > Affects Versions: sprint-7 > Reporter: Pavel Konstantinov > Assignee: Anton Vinogradov > Priority: Critical > Fix For: 1.5 > > Attachments: Plot_ThroughputLatencyProbe_01.png, rebalancing.zip > > > # Start one node with partitioned cache with one backup. > # Load into the cache 40billions of keys using DataStreamer > # Start second node on the same host -- This message was sent by Atlassian JIRA (v6.3.4#6332)