[
https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851833#comment-15851833
]
Ted Yu commented on HBASE-17565:
--------------------------------
bq. we test those functions standalone feeding them unusual values
I think the above is already done by the various tests in
TestStochasticLoadBalancer(2) where very unbalanced clusters are given to
StochasticLoadBalancer where we check the clusters get balanced within the test
duration limit.
It is common practice of using small EPSILON to judge whether a double can be
deemed as 0.0D
I can make EPSILON smaller if you want.
The constant, 1.0f in testNeedBalance(), was introduced by:
HBASE-15529 Override needBalance in StochasticLoadBalancer
[~zghaobac]:
Can you refresh my memory on how the 1.0f was determined ?
Let me look at Guanghao's response over HBASE-17261
> StochasticLoadBalancer may incorrectly skip balancing due to skewed
> multiplier sum
> ----------------------------------------------------------------------------------
>
> Key: HBASE-17565
> URL: https://issues.apache.org/jira/browse/HBASE-17565
> Project: HBase
> Issue Type: Bug
> Reporter: Ted Yu
> Assignee: Ted Yu
> Priority: Critical
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 17565.v1.txt, 17565.v2.txt
>
>
> I was investigating why a 6 node cluster kept skipping balancing requests.
> Here were the region counts on the servers:
> 449, 448, 447, 449, 453, 0
> {code}
> 2017-01-26 22:04:47,145 INFO
> [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000]
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced
> cluster; total cost is 127.0171157050385, sum multiplier is 111087.0 min cost
> which need balance is 0.05
> {code}
> The big multiplier sum caught my eyes. Here was what additional debug logging
> showed:
> {code}
> 2017-01-27 23:25:31,749 DEBUG
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000]
> balancer.StochasticLoadBalancer: class
> org.apache.hadoop.hbase.master.balancer.
> StochasticLoadBalancer$RegionReplicaHostCostFunction with multiplier 100000.0
> 2017-01-27 23:25:31,749 DEBUG
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000]
> balancer.StochasticLoadBalancer: class
> org.apache.hadoop.hbase.master.balancer.
> StochasticLoadBalancer$RegionReplicaRackCostFunction with multiplier 10000.0
> {code}
> Note however, that no table in the cluster used read replica.
> I can think of two ways of fixing this situation:
> 1. If there is no read replica in the cluster, ignore the multipliers for the
> above two functions.
> 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0),
> ignore the multiplier.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)