The balancer logs shared suggest it's deciding to move regions because of
the following factors:

   - Data locality (hdfs blocks for the regions files)
   - Read/Write load
   - Memstore size/utilisation

So you need to look into those stats. It could be that the cluster
is under a "hotspot" situation, where a subset of your regions handle most
of the requests.


Em qui., 29 de ago. de 2024 às 21:22, Frens Jan Rumph
<frens....@web-iq.com.invalid> escreveu:

> Dear HBase users/devs!
>
>
> *Summary*
>
> After a node outage, the HBase balancer was switched off. When turning it
> on later again, the StochasticLoadBalancer increased/created the region
> count skew. Which, given the mostly default configuration is unexpected.
> Any help is much appreciated!
>
>
> *Details:*
>
> I’m fighting an issue with HBase 2.5.7 on a 11 node cluster with ~15.000
> regions from ~1000 tables. I’m hoping that someone has a pointer.
>
> *Incident -> turned balancer off*
>
> We’ve recently lost one of the nodes and ran into severe data imbalance
> issues at the level of the HDFS disks while the cluster was ‘only’ 80%
> full. Some nodes were filling up to over 98%, causing YARN to take these
> nodes out of rotation. We were unable to identify the cause of this
> imbalance. In an attempt to mitigate this, the HBase region balancer was
> disabled.
>
> *Manually under control -> turned balancer on again*
>
> Two region servers had a hard restart after the initial incident, so
> regions were reassigned, but not yet balanced. I didn’t dare turn on the
> balancer right away, fearing to get back into the situation of imbalanced
> disk usage. So regions were manually (with some scripting) re-assigned to
> get back to a balanced situation with ~1500 regions per node; in a naive
> way, similar to the SimpleLoadBalancer.
>
> We’ve got the disk usage fairly balanced right now. So I turned the
> balancer back on.
>
> *Region count skew increased*
>
> However, it started moving regions away from a few nodes quite
> aggressively. Every run it moved 2000 to 4000 regions, expecting a cost
> decrease. But then at the next run, the initial computed cost was higher
> than before. I gave the balancer some rounds, but stopped it as some
> servers had only ~400 regions and others were responsible for 2000+
> regions. Above this limit, splits are prevented.
>
> This chart shows the effect of switching the balancer on from ~09:30, I
> stopped to at ~11:30:
>
> [image: Screenshot 2024-08-29 at 22.04.58.png]
>
>
> Some (formatted) example logging from the Balancer chore:
>
> 2024-08-28 09:57:54,678 INFO  [master/m1:16000.Chore.5] 
> balancer.StochasticLoadBalancer: ...
>     Going from a computed imbalance of 1.4793890584018785 to a new imbalance 
> of 0.69336982505148. funtionCost=
>     RegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.004313540707257566);
>     PrimaryRegionCountSkewCostFunction : (not needed);
>     MoveCostFunction : (multiplier=7.0, imbalance=0.1888262494457465, need 
> balance);
>     ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.39761170318154926, need balance);
>     RackLocalityCostFunction : (multiplier=15.0, imbalance=0.0);
>     TableSkewCostFunction : (multiplier=35.0, imbalance=11.404401695266312, 
> need balance);
>     RegionReplicaHostCostFunction : (not needed);
>     RegionReplicaRackCostFunction : (not needed);
>     ReadRequestCostFunction : (multiplier=5.0, 
> imbalance=0.028254565577063396, need balance);
>     WriteRequestCostFunction : (multiplier=5.0, imbalance=0.7593874996431397, 
> need balance);
>     MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.16192309175499753, need balance);
>     StoreFileCostFunction : (multiplier=5.0, imbalance=0.01758057650125178);
>
> ...
>
> 2024-08-28 10:26:34,946 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=63,queue=3,port=16000] 
> balancer.StochasticLoadBalancer: ...
>     Going from a computed imbalance of 1.5853428527425468 to a new imbalance 
> of 0.6737463520617091. funtionCost=
>     RegionCountSkewCostFunction : (multiplier=500.0, 
> imbalance=0.023543776971639504);
>     PrimaryRegionCountSkewCostFunction : (not needed);
>     MoveCostFunction : (multiplier=7.0, imbalance=0.20349610488314648, need 
> balance);
>     ServerLocalityCostFunction : (multiplier=25.0, 
> imbalance=0.41889718087643735, need balance);
>     RackLocalityCostFunction : (multiplier=15.0, imbalance=0.0);
>     TableSkewCostFunction : (multiplier=35.0, imbalance=10.849642781445127, 
> need balance);
>     RegionReplicaHostCostFunction : (not needed);
>     RegionReplicaRackCostFunction : (not needed);
>     ReadRequestCostFunction : (multiplier=5.0, imbalance=0.02832763401695891, 
> need balance);
>     WriteRequestCostFunction : (multiplier=5.0, imbalance=0.2960273848432453, 
> need balance);
>     MemStoreSizeCostFunction : (multiplier=5.0, 
> imbalance=0.08973896446650413, need balance);
>     StoreFileCostFunction : (multiplier=5.0, imbalance=0.02370918640463713);
>
>
> The balancer has default configuration with one exception,
> hbase.master.balancer.maxRitPercent was set to 0.001 because of the impact
> on availability.
>
> I don’t understand why the balancer would allow such a skew for the region
> count as (per the default configuration), this cost function as a very high
> weight.
>
> I did notice this warning:
>
> calculatedMaxSteps:126008000 for loadbalancer's stochastic walk is larger
> than maxSteps:1000000. Hence load balancing may not work well. Setting
> parameter "hbase.master.balancer.stochastic.runMaxSteps" to true can
> overcome this issue.(This config change does not require service restart)
>
> This might make the balancer perform worse than expected. But I’m under
> the impression that the balancer is eager and takes any randomly generated
> step that decreases the imbalance. With a default weight of 500, I would
> expect region count skew to initially dominate the balancing process.
>
> At a later point in time, I tried to turn the balancer back on again; this
> time after creating an ideal distribution of regions. However, again, in
> just one round the balancer made a complete mess of the region count
> distribution:
>
> [image: Screenshot 2024-08-29 at 22.17.36.png]
>
>
>
>
> I would very much appreciate any insights or pointers into this matter.
>
> Best regards,
> Frens Jan
>
>
>
>                Award-winning OSINT partner for Law Enforcement and Defence.
>
>
>
> *Frens Jan Rumph*
> Data platform engineering lead
>
> phone:
> site:
>
> pgp: +31 50 21 11 622
> web-iq.com
>
> CEE2 A4F1 972E 78C0 F816
> 86BB D096 18E2 3AC0 16E0
> The content of this email is confidential and intended for the
> recipient(s) specified in this message only. It is strictly forbidden to
> share any part of this message with any third party, without a written
> consent of the sender. If you received this message by mistake, please
> reply to this message and follow with its deletion, so that we can ensure
> such a mistake does not occur in the future.
>
>
>

Reply via email to