Jonas, would you be interested in joining one of our performance meetings and presenting some of your work there? Seems like we can have a good discussion about further improvements to the balancer.
Thanks, Neha On Mon, Oct 25, 2021 at 11:39 AM Josh Salomon <jsalo...@redhat.com> wrote: > > Hi Jonas, > > I have some comments - > IMHO you should replace 3 & 4 - if the PGs are not split optimally between > the OSDs per pool, the primary balancing will not help, so I believe 4 is > more important than 3. > There is also a practical reason for this, after we have 1,2,3 constraints > fulfilled, we can implement 4 by just changing the order of OSDs inside the > PGs (at least for replica) which is a cheap operation since it is only upmap > operation and does not require any data movement. > Regards, > > Josh > > > On Mon, Oct 25, 2021 at 9:01 PM Jonas Jelten <jel...@in.tum.de> wrote: >> >> Hi Josh, >> >> yes, there's many factors to optimize... which makes it kinda hard to >> achieve an optimal solution. >> >> I think we have to consider all these things, in ascending priority: >> >> * 1: Minimize distance to CRUSH (prefer fewest upmaps, and remove upmap >> items if balance is better) >> * 2: Relocation of PGs in remapped state (since they are not fully moved >> yet, hence 'easier' to relocate) >> * 3: Per-Pool PG distribution, respecting OSD device size -> ideal_pg_count >> = osd_size * (pg_num / sum(possible_osd_sizes) >> * 4: Primary/EC-N distribution (all osds have equal primary/EC-N counts, for >> workload balancing, not respecting device size (for hdd at least?), else >> this is just 3) >> * 5: Capacity balancing (all osds equally full) >> * 6: And of course CRUSH constraints >> >> Beautiful optimization problem, which could be fed into a solver :) >> My approach currently optimizes for 3, 5, 6, iteratively... >> >> > My only comment about what you did is that it should somehow work pool by >> > pool and manage the +-1 globally. >> >> I think this is already implemented! >> Since each iteration I pick the "fullest" device first, it has to have more >> pools (or data) than other OSDs (e.g. through +1), and we try to migrate a >> PG off it. >> And we only migrate a particular PG of a pool from such a source OSD if it >> has >ideal_amount_for_pool (float, hence we allow moving +1s or worse). >> Same for a destination OSD, it's only selected if has other PGs of that pool >> <ideal_amount_for_pool (float, hence allowing it to become a +1 but not >> more). >> So we eliminate global imbalance, and respect equal PG distribution per pool. >> >> I can try to hack in (optional) constraints so it also supports optimization >> 4, but this works very much against the CRUSH placement (because we'd have >> to ignore OSD size). >> But since this is basically bypassing CRUSH-weights, it could also be done >> by placing all desired devices in a custom crush hierarchy with identical >> weighted buckets (even though "wasting" storage). >> Then we don't have to fight CRUSH and it's a 'simple' optimization 3 again. >> >> To achieve 2 and 1 it's just an re-ordering of candidate PGs. >> So in theory it should be doableā¢. >> >> -- Jonas >> >> >> On 25/10/2021 11.12, Josh Salomon wrote: >> > Hi Jonas, >> > >> > I want to clarify a bit my thoughts (it may be long) regarding balancing >> > in general. >> > >> > 1 - Balancing the capacity correctly is of top priority, this is because >> > we all know that the system is as full as the fullest device and as a >> > storage system we can't allow large capacity which is wasted and can't be >> > used. This is a top functional requirement. >> > 2 - Workload balancing is a performance requirement, and an important one, >> > but we should not optimize workload on behalf of capacity so the challenge >> > is how to do both simultaneously. (hint: it is not always possible, but >> > when this is not possible the system performs less than the aggregated >> > performance of the devices) >> > >> > Assumption 1: Per pool the workload on a PG is linear with the capacity, >> > which means either all PGs have the same workload (#PGs is a power of 2) >> > or some PGs has exactly twice the load as the others. From now on I will >> > assume the number of PGs is a power of 2, since the adjustments to the >> > other case are pretty simple. >> > >> > Conclusion 1: Balancing capacity based on all the PGs is the system may >> > cause workload imbalance - balancing capacity should be done on a pool by >> > pool basis. (assume 2 pools H(ot) and C(old) with exactly the same >> > settings (#PGs, capacity and protection scheme). If you balance per PG >> > capacity only you can have a device with all the PGs from C pool and a >> > device with all the PGs from the H pool - >> > This will cause the second device to be fully loaded while the first >> > device is idle). >> > >> > On the other hand, your point about the +-1 PGs when working on a pool by >> > pool basis is correct and should be fixed (when working on pool by pool >> > basis) >> > >> > When all the devices are identical, the other thing we need to do for >> > balancing the workload is balancing the primaries (on a pool by pool >> > basis) - this means that when the capacity is balanced (every OSD has the >> > same number of PGs per pool) every OSD has also the same number of >> > primaries (+-1) per pool. This is mainly important for replicated pools, >> > for EC pools it is important (but less >> > critical) when working without "fast read" mode, and does not have any >> > effect with EC pools with "fast read" mode enabled. (For EC pools we need >> > to balance the N OSDs from N+K and not only the primaries - think about >> > replica-3 as a special case of EC with 1+2) >> > >> > Now what happens when the devices are not identical - >> > In case of mixing technologies (SSD and HDD) - (this is not recommended, >> > but you can see some use cases for this in my SDC presentation >> > <https://www.youtube.com/watch?v=dz53aH2XggE&feature=emb_imp_woyt>) - >> > without going into deep details the easiest solution is make all the >> > faster (I mean much faster such as HDD/SSD or SSD/PM) devices always >> > primaries and all the slow devices never primaries >> > (assuming you always keep at least one copy on a fast device). More on >> > this in the presentation. >> > >> > The last case is when there are relatively minor performance differences >> > between the devices (HDD with different RPM rate, or devices with the same >> > technology and not the same size, but not a huge difference - I believe >> > that when on device has X times the capacity as others when X > >> > replica-count, we can't balance any more, but I need to complete my >> > calculations). In these cases, assuming we know >> > something about the workload (R/W ratio) we can balance workload by giving >> > more primaries to the faster or smaller devices relative to the slower or >> > larger devices. This may not be optimal but can improve the performance, >> > obviously it will not work for write only workloads, but it can improve >> > performance as the ratio of reads is higher. >> > >> > So to summarize - we need first to balance capacity as perfectly as >> > possible, but if we care about performance we should make sure that the >> > capacity per each pool is balanced almost perfectly. Then we change the >> > primaries based on the devices we have and on the workloads per pool in >> > order to split the workload evenly among the devices. When we have large >> > variance in the devices in the same pool, >> > perfect workload balancing may not be achievable, but we can try and find >> > an optimal one for the configuration and workload we have. >> > >> > Having said all that - I really appreciate your work, and I went briefly >> > over it. My only comment about what you did is that it should somehow work >> > pool by pool and manage the +-1 globally. >> > >> > Regards, >> > >> > Josh >> > _______________________________________________ > Dev mailing list -- d...@ceph.io > To unsubscribe send an email to dev-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io