I was able to get this working with the crushmap in my last post! I now have the intended behavior together with the change of primary affinity on the slow hdds. Very happy, performance is excellent.
One thing was a little weird though, I had to manually change the weight of each hostgroup so that they are in the same ballpark. If they were too far apart ceph couldn't properly allocate 3 buckets for each pg, some ended up being in state "remapped" or "degraded". When I changed the weights (The crush rule selects 3 out of 3 hostgroups anyway so weight isn't even a consideration there) to similar values that problem went away. Perhaps that is a bug? /Peter On 10/8/2017 3:22 PM, David Turner wrote: > > That's correct. It doesn't matter how many copies of the data you have > in each datacenter. The mons control the maps and you should be good > as long as you have 1 mon per DC. You should test this to see how the > recovery goes, but there shouldn't be a problem. > > > On Sat, Oct 7, 2017, 6:10 PM Дробышевский, Владимир <v...@itgorod.ru > <mailto:v...@itgorod.ru>> wrote: > > 2017-10-08 2:02 GMT+05:00 Peter Linder > <peter.lin...@fiberdirekt.se <mailto:peter.lin...@fiberdirekt.se>>: > >> >> Then, I believe, the next best configuration would be to set >> size for this pool to 4. It would choose an NVMe as the >> primary OSD, and then choose an HDD from each DC for the >> secondary copies. This will guarantee that a copy of the >> data goes into each DC and you will have 2 copies in other >> DCs away from the primary NVMe copy. It wastes a copy of all >> of the data in the pool, but that's on the much cheaper HDD >> storage and can probably be considered acceptable losses for >> the sake of having the primary OSD on NVMe drives. > I have considered this, and it should of course work when it > works so to say, but what if 1 datacenter is isolated while > running? We would be left with 2 running copies on each side > for all PGs, with no way of knowing what gets written where. > In the end, data would be destoyed due to the split brain. > Even being able to enforce quorum where the SSD is would mean > a single point of failure. > > In case you have one mon per DC all operations in the isolated DC > will be frozen, so I believe you would not lose data. > > > > >> >> On Sat, Oct 7, 2017 at 3:36 PM Peter Linder >> <peter.lin...@fiberdirekt.se >> <mailto:peter.lin...@fiberdirekt.se>> wrote: >> >> On 10/7/2017 8:08 PM, David Turner wrote: >>> >>> Just to make sure you understand that the reads will >>> happen on the primary osd for the PG and not the nearest >>> osd, meaning that reads will go between the datacenters. >>> Also that each write will not ack until all 3 writes >>> happen adding the latency to the writes and reads both. >>> >>> >> >> Yes, I understand this. It is actually fine, the >> datacenters have been selected so that they are about >> 10-20km apart. This yields around a 0.1 - 0.2ms round >> trip time due to speed of light being too low. >> Nevertheless, latency due to network shouldn't be a >> problem and it's all 40G (dedicated) TRILL network for >> the moment. >> >> I just want to be able to select 1 SSD and 2 HDDs, all >> spread out. I can do that, but one of the HDDs end up in >> the same datacenter, probably because I'm using the >> "take" command 2 times (resets selecting buckets?). >> >> >> >>> On Sat, Oct 7, 2017, 1:48 PM Peter Linder >>> <peter.lin...@fiberdirekt.se >>> <mailto:peter.lin...@fiberdirekt.se>> wrote: >>> >>> On 10/7/2017 7:36 PM, Дробышевский, Владимир wrote: >>>> Hello! >>>> >>>> 2017-10-07 19:12 GMT+05:00 Peter Linder >>>> <peter.lin...@fiberdirekt.se >>>> <mailto:peter.lin...@fiberdirekt.se>>: >>>> >>>> The idea is to select an nvme osd, and >>>> then select the rest from hdd osds in different >>>> datacenters (see crush >>>> map below for hierarchy). >>>> >>>> It's a little bit aside of the question, but why do >>>> you want to mix SSDs and HDDs in the same pool? Do >>>> you have read-intensive workload and going to use >>>> primary-affinity to get all reads from nvme? >>>> >>>> >>> Yes, this is pretty much the idea, getting the >>> performance from NVMe reads, while still maintaining >>> triple redundancy and a reasonable cost. >>> >>> >>>> -- >>>> Regards, >>>> Vladimir >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> <mailto:ceph-users@lists.ceph.com> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Regards, > Vladimir > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com