[ 
https://issues.apache.org/jira/browse/IGNITE-1093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951275#comment-14951275
 ] 

Anton Vinogradov commented on IGNITE-1093:
------------------------------------------

Worked on fixing duplicate partition eviction attempts happens at the end of 
the rebalancing. In case cache contains 20m entries each node evicts 2gb of 
data when third nide join topology. As a result - there is throughput problems 
at the end of rebalancing.




Solution that works is to use singlethreadpool to evict partitions. In this 
case eviction is not so agressive and cause not so long pauses. But this 
solution gives no 100% guarantie that gc pause will be less than 
failureDetectionTimeout and supply node will not left topology.




Second checked solution is to split each partition's eviction to small parts, 
for example no more than 1000 entries. This will give chances for other 
callable submitted to system pool to be evecuted without long delays.

This scheme workes sometimes, but still have good chancess to have big gc 
pauses. 

I checked this solution and it passes 3 of 3 attemps, but when I decided to 
recheck it becomes to fail again.




I think that I tryed to solve result instead of reason.



Currently partitions evicted in a bulk way, for example using checkEvictions 
method, that way seems to be reason of long gc pauses.

Will be better to rent partitions step-by-step at 
EVT_CACHE_REBALANCE_PART_LOADED event.

In this case there will be less chances to have long gc pauses, uniformity in 
this case will be guaranted by rebalancing process that use limited threads 
count.



Thoughts?

> Rebalancing with default parameters is very slow
> ------------------------------------------------
>
>                 Key: IGNITE-1093
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1093
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: sprint-7
>            Reporter: Pavel Konstantinov
>            Assignee: Anton Vinogradov
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: Plot_ThroughputLatencyProbe_01.png, rebalancing.zip
>
>
> # Start one node with partitioned cache with one backup.
> # Load into the cache 40billions of keys using DataStreamer
> # Start second node on the same host



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to