[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

胡玮文 Mon, 26 Oct 2020 09:20:21 -0700

> 在 2020年10月26日，15:43，Frank Schilder <fr...@dtu.dk> 写道：
> 
> 
>> I’ve never seen anything that implies that lead OSDs within an acting set 
>> are a function of CRUSH rule ordering.
> 
> This is actually a good question. I believed that I had seen/heard that 
> somewhere, but I might be wrong.
> 
> Looking at the definition of a PG, is states that a PG is an ordered set of 
> OSD (IDs) and the first up OSD will be the primary. In other words, it seems 
> that the lowest OSD ID is decisive. If the SSDs were deployed before the 
> HDDs, they have the smallest IDs and, hence, will be preferred as primary 
> OSDs.


I don’t think this is correct. From my experiments, using previously mentioned 
CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are 
always SSD.

I also have a look at the code, if I understand it correctly:

* If the default primary affinity is not changed, then the logic about primary 
affinity is skipped, and the primary would be the first one returned by CRUSH 
algorithm [1].

* The order of OSDs returned by CRUSH still matters if you changed the primary 
affinity. The affinity represents the probability of a test to be success. The 
first OSD will be tested first, and will have higher probability to become 
primary. [2]
  * If any OSD has primary affinity = 1.0, the test will always success, and 
any OSD after it will never be primary.
  * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. 
Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability 
of 0.125. Otherwise, 1st will be primary.
  * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be 
primary as fallback.

[1]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456
[2]: 
https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561

So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it 
to be the primary in my case.

Do you think I should contribute these to documentation?

> This, however, is not a sustainable situation. Any addition of OSDs will mess 
> this up and the distribution scheme will fail in the future. A way out seem 
> to be:
> 
> - subdivide your HDD storage using device classes:
> * define a device class for HDDs with primary affinity=0, for example, pick 5 
> HDDs and change their device class to hdd_np (for no primary)
> * set the primary affinity of these HDD OSDs to 0
> * modify your crush rule to use "step take default class hdd_np"
> * this will create a pool with primaries on SSD and balanced storage 
> distribution between SSD and HDD
> * all-HDD pools deployed as usual on class hdd
> * when increasing capacity, one needs to take care of adding disks to hdd_np 
> class and set their primary affinity to 0
> * somewhat increased admin effort, but fully working solution
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Anthony D'Atri <anthony.da...@gmail.com>
> Sent: 25 October 2020 17:07:15
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool
> 
>> I'm not entirely sure if primary on SSD will actually make the read happen 
>> on SSD.
> 
> My understanding is that by default reads always happen from the lead OSD in 
> the acting set.  Octopus seems to (finally) have an option to spread the 
> reads around, which IIRC defaults to false.
> 
> I’ve never seen anything that implies that lead OSDs within an acting set are 
> a function of CRUSH rule ordering. I’m not asserting that they aren’t though, 
> but I’m … skeptical.
> 
> Setting primary affinity would do the job, and you’d want to have cron 
> continually update it across the cluster to react to topology changes.  I was 
> told of this strategy back in 2014, but haven’t personally seen it 
> implemented.
> 
> That said, HDDs are more of a bottleneck for writes than reads and just might 
> be fine for your application.  Tiny reads are going to limit you to some 
> degree regardless of drive type, and you do mention throughput, not IOPS.
> 
> I must echo Frank’s notes about capacity too.  Ceph can do a lot of things, 
> but that doesn’t mean something exotic is necessarily the best choice.  
> You’re concerned about 3R only yielding 1/3 of raw capacity if using an 
> all-SSD cluster, but the architecture you propose limits you anyway because 
> drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, 
> and the cost of you fussing over an exotic solution instead of the hundreds 
> of other things in your backlog.
> 
> And your cluster as described is *tiny*.  Honestly I’d suggest considering 
> one of these alternatives:
> 
> * Ditch the HDDs, use QLC flash.  The emerging EDSFF drives are really 
> promising for replacing HDDs for density in this kind of application.  You 
> might even consider ARM if IOPs aren’t a concern.
> * An NVMeoF solution
> 
> 
> Cache tiers are “deprecated”, but then so are custom cluster names.  Neither 
> appears
> 
>> For EC pools there is an option "fast_read" 
>> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfast_read%23set-pool-values&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&amp;reserved=0),
>>  which states that a read will return as soon as the first k shards have 
>> arrived. The default is to wait for all k+m shards (all replicas). This 
>> option is not available for replicated pools.
>> Now, not sure if this option is not available for replicated pools because 
>> the read will always be served by the acting primary, or if it currently 
>> waits for all replicas. In the latter case, reads will wait for the slowest 
>> device.
>> I'm not sure if I interpret this correctly. I think you should test the 
>> setup with HDD only and SSD+HDD to see if read speed improves. Note that 
>> write speed will always depend on the slowest device.
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ________________________________________
>> From: Frank Schilder <fr...@dtu.dk>
>> Sent: 25 October 2020 15:03:16
>> To: 胡 玮文; Alexander E. Patrakov
>> Cc: ceph-users@ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
>> pool
>> A cache pool might be an alternative, heavily depending on how much data is 
>> hot. However, then you will have much less SSD capacity available, because 
>> it also requires replication.
>> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD 
>> you will probably run short of SSD capacity. Or, looking at it the other way 
>> around, with copies on 1 SSD+3HDD, you will only be able to use about 30T 
>> out of 120T HDD capacity.
>> With this replication, the usable storage will be 10T and raw used will be 
>> 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you 
>> will need more SSDs. If your servers have more free disk slots, you can add 
>> SSDs over time until you have at least 40T SSD capacity to balance SSD and 
>> HDD capacity.
>> Personally, I think the 1SSD + 3HDD is a good option compared with a cache 
>> pool. You have the data security of 3-times replication and, if everything 
>> is up, need only 1 copy in the SSD cache, which means that you have 3 times 
>> the cache capacity.
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> ________________________________________
>> From: 胡 玮文 <huw...@outlook.com>
>> Sent: 25 October 2020 13:40:55
>> To: Alexander E. Patrakov
>> Cc: ceph-users@ceph.io
>> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated 
>> pool
>> Yes. This is the limitation of CRUSH algorithm, in my mind. In order to 
>> guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and 
>> 3 on HDD. This will work as intended, right? Because at least I can ensure 3 
>> HDDs are from different hosts.
>>>> 在 2020年10月25日，20:04，Alexander E. Patrakov <patra...@gmail.com> 写道：
>>> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com <huw...@outlook.com> 
>>> wrote:
>>>> Hi all,
>>>> We are planning for a new pool to store our dataset using CephFS. These 
>>>> data are almost read-only (but not guaranteed) and consist of a lot of 
>>>> small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and 
>>>> we will deploy about 10 such nodes. We aim at getting the highest read 
>>>> throughput.
>>>> If we just use a replicated pool of size 3 on SSD, we should get the best 
>>>> performance, however, that only leave us 1/3 of usable SSD space. And EC 
>>>> pools are not friendly to such small object read workload, I think.
>>>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I 
>>>> want 3 data replications, each on a different host (fail domain). 1 of 
>>>> them on SSD, the other 2 on HDD. And normally every read request is 
>>>> directed to SSD. So, if every SSD OSD is up, I’d expect the same read 
>>>> throughout as the all SSD deployment.
>>>> I’ve read the documents and did some tests. Here is the crush rule I’m 
>>>> testing with:
>>>> rule mixed_replicated_rule {
>>>>     id 3
>>>>     type replicated
>>>>     min_size 1
>>>>     max_size 10
>>>>     step take default class ssd
>>>>     step chooseleaf firstn 1 type host
>>>>     step emit
>>>>     step take default class hdd
>>>>     step chooseleaf firstn -1 type host
>>>>     step emit
>>>> }
>>>> Now I have the following conclusions, but I’m not very sure:
>>>> * The first OSD produced by crush will be the primary OSD (at least if I 
>>>> don’t change the “primary affinity”). So, the above rule is guaranteed to 
>>>> map SSD OSD as primary in pg. And every read request will read from SSD if 
>>>> it is up.
>>>> * It is currently not possible to enforce SSD and HDD OSD to be chosen 
>>>> from different hosts. So, if I want to ensure data availability even if 2 
>>>> hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the 
>>>> replication size to 4, instead of the ideal value 3, on the pool using the 
>>>> above crush rule.
>>>> Am I correct about the above statements? How would this work from your 
>>>> experience? Thanks.
>>> This works (i.e. guards against host failures) only if you have
>>> strictly separate sets of hosts that have SSDs and that have HDDs.
>>> I.e., there should be no host that has both, otherwise there is a
>>> chance that one hdd and one ssd from that host will be picked.
>>> --
>>> Alexander E. Patrakov
>>> CV: 
>>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&amp;data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFHNS8F6IIchsrk%3D&amp;reserved=0
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool

Reply via email to