Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard 
against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on 
HDD. This will work as intended, right? Because at least I can ensure 3 HDDs 
are from different hosts.

> 在 2020年10月25日,20:04,Alexander E. Patrakov <patra...@gmail.com> 写道:
> 
> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com <huw...@outlook.com> 
> wrote:
>> 
>> Hi all,
>> 
>> We are planning for a new pool to store our dataset using CephFS. These data 
>> are almost read-only (but not guaranteed) and consist of a lot of small 
>> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will 
>> deploy about 10 such nodes. We aim at getting the highest read throughput.
>> 
>> If we just use a replicated pool of size 3 on SSD, we should get the best 
>> performance, however, that only leave us 1/3 of usable SSD space. And EC 
>> pools are not friendly to such small object read workload, I think.
>> 
>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 
>> 3 data replications, each on a different host (fail domain). 1 of them on 
>> SSD, the other 2 on HDD. And normally every read request is directed to SSD. 
>> So, if every SSD OSD is up, I’d expect the same read throughout as the all 
>> SSD deployment.
>> 
>> I’ve read the documents and did some tests. Here is the crush rule I’m 
>> testing with:
>> 
>> rule mixed_replicated_rule {
>>        id 3
>>        type replicated
>>        min_size 1
>>        max_size 10
>>        step take default class ssd
>>        step chooseleaf firstn 1 type host
>>        step emit
>>        step take default class hdd
>>        step chooseleaf firstn -1 type host
>>        step emit
>> }
>> 
>> Now I have the following conclusions, but I’m not very sure:
>> * The first OSD produced by crush will be the primary OSD (at least if I 
>> don’t change the “primary affinity”). So, the above rule is guaranteed to 
>> map SSD OSD as primary in pg. And every read request will read from SSD if 
>> it is up.
>> * It is currently not possible to enforce SSD and HDD OSD to be chosen from 
>> different hosts. So, if I want to ensure data availability even if 2 hosts 
>> fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the 
>> replication size to 4, instead of the ideal value 3, on the pool using the 
>> above crush rule.
>> 
>> Am I correct about the above statements? How would this work from your 
>> experience? Thanks.
> 
> This works (i.e. guards against host failures) only if you have
> strictly separate sets of hosts that have SSDs and that have HDDs.
> I.e., there should be no host that has both, otherwise there is a
> chance that one hdd and one ssd from that host will be picked.
> 
> -- 
> Alexander E. Patrakov
> CV: 
> https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&amp;data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3D&amp;reserved=0
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to