Thanks for digging this out. I believed to remember exactly this method (don't 
know where from), but couldn't find it in the documentation and started 
doubting it. Yes, this would be very useful information to add to the 
documentation and it also confirms that your simpler setup with just a 
specialized crush rule will work exactly as intended and is long-term stable.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: 胡 玮文 <huw...@outlook.com>
Sent: 26 October 2020 17:19
To: Frank Schilder
Cc: Anthony D'Atri; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
pool

> 在 2020年10月26日,15:43,Frank Schilder <fr...@dtu.dk> 写道:
>
> 
>> I’ve never seen anything that implies that lead OSDs within an acting set 
>> are a function of CRUSH rule ordering.
>
> This is actually a good question. I believed that I had seen/heard that 
> somewhere, but I might be wrong.
>
> Looking at the definition of a PG, is states that a PG is an ordered set of 
> OSD (IDs) and the first up OSD will be the primary. In other words, it seems 
> that the lowest OSD ID is decisive. If the SSDs were deployed before the 
> HDDs, they have the smallest IDs and, hence, will be preferred as primary 
> OSDs.

I don’t think this is correct. From my experiments, using previously mentioned 
CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are 
always SSD.

I also have a look at the code, if I understand it correctly:

* If the default primary affinity is not changed, then the logic about primary 
affinity is skipped, and the primary would be the first one returned by CRUSH 
algorithm [1].

* The order of OSDs returned by CRUSH still matters if you changed the primary 
affinity. The affinity represents the probability of a test to be success. The 
first OSD will be tested first, and will have higher probability to become 
primary. [2]
  * If any OSD has primary affinity = 1.0, the test will always success, and 
any OSD after it will never be primary.
  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. 
Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability 
of 0.125. Otherwise, 1st will be primary.
  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be 
primary as fallback.

[1]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456
[2]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561

So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it 
to be the primary in my case.

Do you think I should contribute these to documentation?

> This, however, is not a sustainable situation. Any addition of OSDs will mess 
> this up and the distribution scheme will fail in the future. A way out seem 
> to be:
>
> - subdivide your HDD storage using device classes:
> * define a device class for HDDs with primary affinity=0, for example, pick 5 
> HDDs and change their device class to hdd_np (for no primary)
> * set the primary affinity of these HDD OSDs to 0
> * modify your crush rule to use "step take default class hdd_np"
> * this will create a pool with primaries on SSD and balanced storage 
> distribution between SSD and HDD
> * all-HDD pools deployed as usual on class hdd
> * when increasing capacity, one needs to take care of adding disks to hdd_np 
> class and set their primary affinity to 0
> * somewhat increased admin effort, but fully working solution
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to