> 在 2020年10月26日,15:43,Frank Schilder <fr...@dtu.dk> 写道: > > >> I’ve never seen anything that implies that lead OSDs within an acting set >> are a function of CRUSH rule ordering. > > This is actually a good question. I believed that I had seen/heard that > somewhere, but I might be wrong. > > Looking at the definition of a PG, is states that a PG is an ordered set of > OSD (IDs) and the first up OSD will be the primary. In other words, it seems > that the lowest OSD ID is decisive. If the SSDs were deployed before the > HDDs, they have the smallest IDs and, hence, will be preferred as primary > OSDs.
I don’t think this is correct. From my experiments, using previously mentioned CRUSH rule, no matter what the IDs of the SSD OSDs are, the primary OSDs are always SSD. I also have a look at the code, if I understand it correctly: * If the default primary affinity is not changed, then the logic about primary affinity is skipped, and the primary would be the first one returned by CRUSH algorithm [1]. * The order of OSDs returned by CRUSH still matters if you changed the primary affinity. The affinity represents the probability of a test to be success. The first OSD will be tested first, and will have higher probability to become primary. [2] * If any OSD has primary affinity = 1.0, the test will always success, and any OSD after it will never be primary. * Suppose CRUSH returned 3 OSDs, each one has primary affinity set to 0.5. Then the 2nd OSD has probability of 0.25 to be primary, 3rd one has probability of 0.125. Otherwise, 1st will be primary. * If no test success (Suppose all OSDs have affinity of 0), 1st OSD will be primary as fallback. [1]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2456 [2]: https://github.com/ceph/ceph/blob/6dc03460ffa1315e91ea21b1125200d3d5a01253/src/osd/OSDMap.cc#L2561 So, set the primary affinity of all SSD OSDs to 1.0 should be sufficient for it to be the primary in my case. Do you think I should contribute these to documentation? > This, however, is not a sustainable situation. Any addition of OSDs will mess > this up and the distribution scheme will fail in the future. A way out seem > to be: > > - subdivide your HDD storage using device classes: > * define a device class for HDDs with primary affinity=0, for example, pick 5 > HDDs and change their device class to hdd_np (for no primary) > * set the primary affinity of these HDD OSDs to 0 > * modify your crush rule to use "step take default class hdd_np" > * this will create a pool with primaries on SSD and balanced storage > distribution between SSD and HDD > * all-HDD pools deployed as usual on class hdd > * when increasing capacity, one needs to take care of adding disks to hdd_np > class and set their primary affinity to 0 > * somewhat increased admin effort, but fully working solution > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Anthony D'Atri <anthony.da...@gmail.com> > Sent: 25 October 2020 17:07:15 > To: ceph-users@ceph.io > Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated pool > >> I'm not entirely sure if primary on SSD will actually make the read happen >> on SSD. > > My understanding is that by default reads always happen from the lead OSD in > the acting set. Octopus seems to (finally) have an option to spread the > reads around, which IIRC defaults to false. > > I’ve never seen anything that implies that lead OSDs within an acting set are > a function of CRUSH rule ordering. I’m not asserting that they aren’t though, > but I’m … skeptical. > > Setting primary affinity would do the job, and you’d want to have cron > continually update it across the cluster to react to topology changes. I was > told of this strategy back in 2014, but haven’t personally seen it > implemented. > > That said, HDDs are more of a bottleneck for writes than reads and just might > be fine for your application. Tiny reads are going to limit you to some > degree regardless of drive type, and you do mention throughput, not IOPS. > > I must echo Frank’s notes about capacity too. Ceph can do a lot of things, > but that doesn’t mean something exotic is necessarily the best choice. > You’re concerned about 3R only yielding 1/3 of raw capacity if using an > all-SSD cluster, but the architecture you propose limits you anyway because > drive size. Consider also chassis, CPU, RAM, RU, switch port costs as well, > and the cost of you fussing over an exotic solution instead of the hundreds > of other things in your backlog. > > And your cluster as described is *tiny*. Honestly I’d suggest considering > one of these alternatives: > > * Ditch the HDDs, use QLC flash. The emerging EDSFF drives are really > promising for replacing HDDs for density in this kind of application. You > might even consider ARM if IOPs aren’t a concern. > * An NVMeoF solution > > > Cache tiers are “deprecated”, but then so are custom cluster names. Neither > appears > >> For EC pools there is an option "fast_read" >> (https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.ceph.com%2Fen%2Flatest%2Frados%2Foperations%2Fpools%2F%3Fhighlight%3Dfast_read%23set-pool-values&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296398933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Bo40BvimPFg6xofPdTJxSW3Hs9AXyGvCBQWc%2F%2F8OCfg%3D&reserved=0), >> which states that a read will return as soon as the first k shards have >> arrived. The default is to wait for all k+m shards (all replicas). This >> option is not available for replicated pools. >> Now, not sure if this option is not available for replicated pools because >> the read will always be served by the acting primary, or if it currently >> waits for all replicas. In the latter case, reads will wait for the slowest >> device. >> I'm not sure if I interpret this correctly. I think you should test the >> setup with HDD only and SSD+HDD to see if read speed improves. Note that >> write speed will always depend on the slowest device. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: Frank Schilder <fr...@dtu.dk> >> Sent: 25 October 2020 15:03:16 >> To: 胡 玮文; Alexander E. Patrakov >> Cc: ceph-users@ceph.io >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated >> pool >> A cache pool might be an alternative, heavily depending on how much data is >> hot. However, then you will have much less SSD capacity available, because >> it also requires replication. >> Looking at the setup that you have only 10*1T =10T SSD, but 20*6T = 120T HDD >> you will probably run short of SSD capacity. Or, looking at it the other way >> around, with copies on 1 SSD+3HDD, you will only be able to use about 30T >> out of 120T HDD capacity. >> With this replication, the usable storage will be 10T and raw used will be >> 10T SSD and 30T HDD. If you can't do anything else on the HDD space, you >> will need more SSDs. If your servers have more free disk slots, you can add >> SSDs over time until you have at least 40T SSD capacity to balance SSD and >> HDD capacity. >> Personally, I think the 1SSD + 3HDD is a good option compared with a cache >> pool. You have the data security of 3-times replication and, if everything >> is up, need only 1 copy in the SSD cache, which means that you have 3 times >> the cache capacity. >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ________________________________________ >> From: 胡 玮文 <huw...@outlook.com> >> Sent: 25 October 2020 13:40:55 >> To: Alexander E. Patrakov >> Cc: ceph-users@ceph.io >> Subject: [ceph-users] Re: The feasibility of mixed SSD and HDD replicated >> pool >> Yes. This is the limitation of CRUSH algorithm, in my mind. In order to >> guard against 2 host failures, I’m going to use 4 replications, 1 on SSD and >> 3 on HDD. This will work as intended, right? Because at least I can ensure 3 >> HDDs are from different hosts. >>>> 在 2020年10月25日,20:04,Alexander E. Patrakov <patra...@gmail.com> 写道: >>> On Sun, Oct 25, 2020 at 12:11 PM huw...@outlook.com <huw...@outlook.com> >>> wrote: >>>> Hi all, >>>> We are planning for a new pool to store our dataset using CephFS. These >>>> data are almost read-only (but not guaranteed) and consist of a lot of >>>> small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and >>>> we will deploy about 10 such nodes. We aim at getting the highest read >>>> throughput. >>>> If we just use a replicated pool of size 3 on SSD, we should get the best >>>> performance, however, that only leave us 1/3 of usable SSD space. And EC >>>> pools are not friendly to such small object read workload, I think. >>>> Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I >>>> want 3 data replications, each on a different host (fail domain). 1 of >>>> them on SSD, the other 2 on HDD. And normally every read request is >>>> directed to SSD. So, if every SSD OSD is up, I’d expect the same read >>>> throughout as the all SSD deployment. >>>> I’ve read the documents and did some tests. Here is the crush rule I’m >>>> testing with: >>>> rule mixed_replicated_rule { >>>> id 3 >>>> type replicated >>>> min_size 1 >>>> max_size 10 >>>> step take default class ssd >>>> step chooseleaf firstn 1 type host >>>> step emit >>>> step take default class hdd >>>> step chooseleaf firstn -1 type host >>>> step emit >>>> } >>>> Now I have the following conclusions, but I’m not very sure: >>>> * The first OSD produced by crush will be the primary OSD (at least if I >>>> don’t change the “primary affinity”). So, the above rule is guaranteed to >>>> map SSD OSD as primary in pg. And every read request will read from SSD if >>>> it is up. >>>> * It is currently not possible to enforce SSD and HDD OSD to be chosen >>>> from different hosts. So, if I want to ensure data availability even if 2 >>>> hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the >>>> replication size to 4, instead of the ideal value 3, on the pool using the >>>> above crush rule. >>>> Am I correct about the above statements? How would this work from your >>>> experience? Thanks. >>> This works (i.e. guards against host failures) only if you have >>> strictly separate sets of hosts that have SSDs and that have HDDs. >>> I.e., there should be no host that has both, otherwise there is a >>> chance that one hdd and one ssd from that host will be picked. >>> -- >>> Alexander E. Patrakov >>> CV: >>> https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpc.cd%2FPLz7&data=04%7C01%7C%7Ce613593b4d47494af5b008d87982e012%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637392950296403925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XiorXPFtAH4%2BFQsK5jM5Q%2B8ajuJfqFHNS8F6IIchsrk%3D&reserved=0 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io