> On Oct 20, 2021, at 1:49 PM, Josh Salomon <jsalo...@redhat.com> wrote:
> 
> but in the extreme case (some capacity on 1TB devices and some on 6TB 
> devices) the workload can't be balanced. I

It’s also super easy in such a scenario to

a) Have the larger drives not uniformly spread across failure domains, which 
can lead to fractional capacity that is unusuable because it can’t meet 
replication policy.

b) Find the OSDs on the larger drives exceeding the configured max PG per OSD 
figure and refusing to activate, especially when maintenance, failures, or 
other topology changes precipitate recovery.  This has bitten me with a mix of 
1.x and 3.84 TB drives; I ended up raising the limit to 1000 while I juggled 
drives, nodes, and clusters so that a given cluster had uniformly sized drives. 
 At smaller scales of course that often won’t be an option.


> primary affinity can help with a single pool - with multiple pools with 
> different r/w ratio it becomes messy since pa is per device - it could help 
> more if it was per device/pool pair. Also it could be more useful if the 
> values were not 0-1 but 0-replica_count, but this is a usability issue, not 
> functional, it just makes the use more cumbersome. It was designed for a 
> different purpose though so this is not the "right" solution, the right 
> solution is primary balancer.   


Absolutely.  I had the luxury of clusters containing a single pool.  In the 
above instance, before refactoring the nodes/drives, we achieved an easy 15-20% 
increase in aggregate read performance by applying a very rough guestimate of 
affinities based on OSD size.  The straw-draw factor does complicate deriving 
the *optimal* mapping of values, especially when topology changes.

I’ve seen someone set the CRUSH weight of larger/outlier OSDs artificially low 
to balance workload.  All depends on the topology, future plans, and local 
priorities.

> I don't quite understand your "huge server" scenario, other than a basic 
> understanding that the balancer cannot do magic in some impossible cases.

I read it as describing a cluster where nodes / failure domains have 
significantly non-uniform CRUSH weights.  Which is suboptimal, but sometimes 
folks don’t have a choice.  Or during migration between chassis generations.  
Back around … Firefly I think it was, there were a couple of bugs that resulted 
in undesirable behavior in those scenarios.

— aad

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to