Jonas, would you be interested in joining one of our performance
meetings and presenting some of your work there? Seems like we can
have a good discussion about further improvements to the balancer.

Thanks,
Neha

On Mon, Oct 25, 2021 at 11:39 AM Josh Salomon <jsalo...@redhat.com> wrote:
>
> Hi Jonas,
>
> I have some comments -
> IMHO you should replace 3 & 4 - if the PGs are not split optimally between 
> the OSDs per pool, the primary balancing will not help, so I believe 4 is 
> more important than 3.
> There is also a practical reason for this, after we have 1,2,3 constraints 
> fulfilled, we can implement 4 by just changing the order of OSDs inside the 
> PGs (at least for replica) which is a cheap operation since it is only upmap 
> operation and does not require any data movement.
> Regards,
>
> Josh
>
>
> On Mon, Oct 25, 2021 at 9:01 PM Jonas Jelten <jel...@in.tum.de> wrote:
>>
>> Hi Josh,
>>
>> yes, there's many factors to optimize... which makes it kinda hard to 
>> achieve an optimal solution.
>>
>> I think we have to consider all these things, in ascending priority:
>>
>> * 1: Minimize distance to CRUSH (prefer fewest upmaps, and remove upmap 
>> items if balance is better)
>> * 2: Relocation of PGs in remapped state (since they are not fully moved 
>> yet, hence 'easier' to relocate)
>> * 3: Per-Pool PG distribution, respecting OSD device size -> ideal_pg_count 
>> = osd_size * (pg_num / sum(possible_osd_sizes)
>> * 4: Primary/EC-N distribution (all osds have equal primary/EC-N counts, for 
>> workload balancing, not respecting device size (for hdd at least?), else 
>> this is just 3)
>> * 5: Capacity balancing (all osds equally full)
>> * 6: And of course CRUSH constraints
>>
>> Beautiful optimization problem, which could be fed into a solver :)
>> My approach currently optimizes for 3, 5, 6, iteratively...
>>
>> > My only comment about what you did is that it should somehow work pool by 
>> > pool and manage the +-1 globally.
>>
>> I think this is already implemented!
>> Since each iteration I pick the "fullest" device first, it has to have more 
>> pools (or data) than other OSDs (e.g. through +1), and we try to migrate a 
>> PG off it.
>> And we only migrate a particular PG of a pool from such a source OSD if it 
>> has >ideal_amount_for_pool (float, hence we allow moving +1s or worse).
>> Same for a destination OSD, it's only selected if has other PGs of that pool 
>> <ideal_amount_for_pool (float, hence allowing it to become a +1 but not 
>> more).
>> So we eliminate global imbalance, and respect equal PG distribution per pool.
>>
>> I can try to hack in (optional) constraints so it also supports optimization 
>> 4, but this works very much against the CRUSH placement (because we'd have 
>> to ignore OSD size).
>> But since this is basically bypassing CRUSH-weights, it could also be done 
>> by placing all desired devices in a custom crush hierarchy with identical 
>> weighted buckets (even though "wasting" storage).
>> Then we don't have to fight CRUSH and it's a 'simple' optimization 3 again.
>>
>> To achieve 2 and 1 it's just an re-ordering of candidate PGs.
>> So in theory it should be doableā„¢.
>>
>> -- Jonas
>>
>>
>> On 25/10/2021 11.12, Josh Salomon wrote:
>> > Hi Jonas,
>> >
>> > I want to clarify a bit my thoughts (it may be long) regarding balancing 
>> > in general.
>> >
>> > 1 - Balancing the capacity correctly is of top priority, this is because 
>> > we all know that the system is as full as the fullest device and as a 
>> > storage system we can't allow large capacity which is wasted and can't be 
>> > used. This is a top functional requirement.
>> > 2 - Workload balancing is a performance requirement, and an important one, 
>> > but we should not optimize workload on behalf of capacity so the challenge 
>> > is how to do both simultaneously. (hint: it is not always possible, but 
>> > when this is not possible the system performs less than the aggregated 
>> > performance of the devices)
>> >
>> > Assumption 1: Per pool the workload on a PG is linear with the capacity, 
>> > which means either all PGs have the same workload (#PGs is a power of 2) 
>> > or some PGs has exactly twice the load as the others. From now on I will 
>> > assume the number of PGs is a power of 2, since the adjustments to the 
>> > other case are pretty simple.
>> >
>> > Conclusion 1: Balancing capacity based on all the PGs is the system may 
>> > cause workload imbalance - balancing capacity should be done on a pool by 
>> > pool basis. (assume 2 pools H(ot) and C(old) with exactly the same 
>> > settings (#PGs, capacity and protection scheme). If you balance per PG 
>> > capacity only you can have a device with all the PGs from C pool and a 
>> > device with all the PGs from the H pool -
>> > This will cause the second device to be fully loaded while the first 
>> > device is idle).
>> >
>> > On the other hand, your point about the +-1 PGs when working on a pool by 
>> > pool basis is correct and should be fixed (when working on pool by pool 
>> > basis)
>> >
>> > When all the devices are identical, the other thing we need to do for 
>> > balancing the workload is balancing the primaries (on a pool by pool 
>> > basis) - this means that when the capacity is balanced (every OSD has the 
>> > same number of PGs per pool) every OSD has also the same number of 
>> > primaries (+-1) per pool. This is mainly important for replicated pools, 
>> > for EC pools it is important (but less
>> > critical) when working without "fast read" mode, and does not have any 
>> > effect with EC pools with "fast read" mode enabled. (For EC pools we need 
>> > to balance the N OSDs from N+K and not only the primaries - think about 
>> > replica-3 as a special case of EC with 1+2)
>> >
>> > Now what happens when the devices are not identical -
>> > In case of mixing technologies (SSD and HDD) - (this is not recommended, 
>> > but you can see some use cases for this in my SDC presentation 
>> > <https://www.youtube.com/watch?v=dz53aH2XggE&feature=emb_imp_woyt>) - 
>> > without going into deep details the easiest solution is make all the 
>> > faster (I mean much faster such as HDD/SSD or SSD/PM) devices always 
>> > primaries and all the slow devices never primaries
>> > (assuming you always keep at least one copy on a fast device). More on 
>> > this in the presentation.
>> >
>> > The last case is when there are relatively minor performance differences 
>> > between the devices (HDD with different RPM rate, or devices with the same 
>> > technology and not the same size, but not a huge difference - I believe 
>> > that when on device has X times the capacity as others when X > 
>> > replica-count, we can't balance any more, but I need to complete my 
>> > calculations). In these cases, assuming we know
>> > something about the workload (R/W ratio) we can balance workload by giving 
>> > more primaries to the faster or smaller devices relative to the slower or 
>> > larger devices. This may not be optimal but can improve the performance, 
>> > obviously it will not work for write only workloads, but it can improve 
>> > performance as the ratio of reads is higher.
>> >
>> > So to summarize - we need first to balance capacity as perfectly as 
>> > possible, but if we care about performance we should make sure that the 
>> > capacity per each pool is balanced almost perfectly. Then we change the 
>> > primaries based on the devices we have and on the workloads per pool in 
>> > order to split the workload evenly among the devices. When we have large 
>> > variance in the devices in the same pool,
>> > perfect workload balancing may not be achievable, but we can try and find 
>> > an optimal one for the configuration and workload we have.
>> >
>> > Having said all that - I really appreciate your work, and I went briefly 
>> > over it. My only comment about what you did is that it should somehow work 
>> > pool by pool and manage the +-1 globally.
>> >
>> > Regards,
>> >
>> > Josh
>>
> _______________________________________________
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to