Yes. This is the limitation of CRUSH algorithm, in my mind. In order to guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 3 on HDD. This will work as intended, right? Because at least I can ensure 3 HDDs are from different hosts.
> 在 2020年10月25日,20:04,Alexander E. Patrakov <patra...@gmail.com> 写道: > > On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com <huw...@outlook.com> > wrote: >> >> Hi all, >> >> We are planning for a new pool to store our dataset using CephFS. These data >> are almost read-only (but not guaranteed) and consist of a lot of small >> files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will >> deploy about 10 such nodes. We aim at getting the highest read throughput. >> >> If we just use a replicated pool of size 3 on SSD, we should get the best >> performance, however, that only leave us 1/3 of usable SSD space. And EC >> pools are not friendly to such small object read workload, I think. >> >> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want >> 3 data replications, each on a different host (fail domain). 1 of them on >> SSD, the other 2 on HDD. And normally every read request is directed to SSD. >> So, if every SSD OSD is up, I’d expect the same read throughout as the all >> SSD deployment. >> >> I’ve read the documents and did some tests. Here is the crush rule I’m >> testing with: >> >> rule mixed_replicated_rule { >> id 3 >> type replicated >> min_size 1 >> max_size 10 >> step take default class ssd >> step chooseleaf firstn 1 type host >> step emit >> step take default class hdd >> step chooseleaf firstn -1 type host >> step emit >> } >> >> Now I have the following conclusions, but I’m not very sure: >> * The first OSD produced by crush will be the primary OSD (at least if I >> don’t change the “primary affinity”). So, the above rule is guaranteed to >> map SSD OSD as primary in pg. And every read request will read from SSD if >> it is up. >> * It is currently not possible to enforce SSD and HDD OSD to be chosen from >> different hosts. So, if I want to ensure data availability even if 2 hosts >> fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the >> replication size to 4, instead of the ideal value 3, on the pool using the >> above crush rule. >> >> Am I correct about the above statements? How would this work from your >> experience? Thanks. > > This works (i.e. guards against host failures) only if you have > strictly separate sets of hosts that have SSDs and that have HDDs. > I.e., there should be no host that has both, otherwise there is a > chance that one hdd and one ssd from that host will be picked. > > -- > Alexander E. Patrakov > CV: > https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&data=04%7C01%7C%7Cfdfe2029034643f3f2f408d878de2b44%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392242885406736%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8NY0IpDiDnLZV2FGxwChZmNC8IA6%2BsZ2NEHPb%2B%2BEiA0%3D&reserved=0 _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io