>> Note: "step chose" was selected by creating the crush rule with ceph on pool >> creation. If the default should be "step choseleaf" (with OSD buckets), then >> the automatic crush rule generation in ceph ought to be fixed for EC >> profiles. > Interesting. Which exact command was used to create the pool?
I can reproduce. By default with "host" failure domain, the resulting rule will "chooseleaf indep host". But if you create an ec profile with crush-failure-domain=osd, then resulting rules will "choose indep osd". We should open a tracker for this. Either "choose indep osd" and "chooseleaf indep osd" should be give the same result, or the pool creation should use "chooseleaf indep osd" in this case. -- dan On Tue, Aug 30, 2022 at 1:43 PM Dan van der Ster <dvand...@gmail.com> wrote: > > > Note: "step chose" was selected by creating the crush rule with ceph on > > pool creation. If the default should be "step choseleaf" (with OSD > > buckets), then the automatic crush rule generation in ceph ought to be > > fixed for EC profiles. > > Interesting. Which exact command was used to create the pool? > > > These experiments indicate that there is a very weird behaviour > > implemented, I would actually call this a serious bug. > > I don't think this is a bug. Each of your attempts with different > _tries values changed the max iterations of the various loops in > crush. Since this takes crush on different "paths" to find a valid > OSD, the output is going to be different. > > > The resulting mapping should be independent of the maximum number of trials > > No this is wrong.. the "tunables" change the mapping. The important > thing is that every node + client in the cluster agrees on the mapping > -- and indeed since they all use the same tunables, including the > values for *_tries, they will all agree on the up/acting set. > > Cheers, Dan > > On Tue, Aug 30, 2022 at 1:10 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > Hi Dan, > > > > thanks a lot for looking into this. I can't entirely reproduce your > > results. Maybe we are using different versions and there was a change? I'm > > testing with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16. > > > > Note: "step chose" was selected by creating the crush rule with ceph on > > pool creation. If the default should be "step choseleaf" (with OSD > > buckets), then the automatic crush rule generation in ceph ought to be > > fixed for EC profiles. > > > > My results with the same experiments as you did, I can partly confirm and > > partly I see oddness that I would consider a bug (reported at the very end): > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step take default > > step choose indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > > acting ([6,1,4,5,3,1], p6) > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], > > p6) > > > > So far, so good. Now the oddness: > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 5 > > step set_choose_tries 100 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], > > p6) > > > > How can this be different?? I thought crush returns on the first successful > > mapping. This ought to be identical to the previous one. It gets even more > > weird: > > > > rule fs-data { > > id 1 > > type erasure > > min_size 3 > > max_size 6 > > step set_chooseleaf_tries 50 > > step set_choose_tries 200 > > step take default > > step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > osdmaptool: osdmap file 'osdmap.bin' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > > ([6,1,4,5,3,1], p6) > > > > Whaaaaat???? We increase the maximum number of trials for searching and we > > end up with an invalid mapping?? > > > > These experiments indicate that there is a very weird behaviour > > implemented, I would actually call this a serious bug. The resulting > > mapping should be independent of the maximum number of trials (if I > > understood the crush algorithm correctly). In any case, a valid mapping > > should never be replaced in favour of an invalid one (containing a down+out > > OSD). > > > > For now there is a happy end on my test cluster: > > > > # ceph pg dump pgs_brief | grep 4.1c > > dumped pgs_brief > > 4.1c active+remapped+backfilling [6,1,4,5,3,8] 6 > > [6,1,4,5,3,1] 6 > > > > Please look into the extremely odd behaviour reported above. I'm quite > > confident that this is unintended if not dangerous behaviour and should be > > corrected. I'm willing to file a tracker item with the data above. I'm > > actually wondering if this might be related to > > https://tracker.ceph.com/issues/56995 . > > > > Thanks for tracking this down and best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Dan van der Ster <dvand...@gmail.com> > > Sent: 30 August 2022 12:16:37 > > To: Frank Schilder > > Cc: ceph-users@ceph.io > > Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the same OSD > > twice. > > > > BTW, the defaults for _tries seems to work too: > > > > > > # diff -u crush.txt crush.txt2 > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > +++ crush.txt2 2022-08-30 11:55:45.601891010 +0200 > > @@ -90,10 +90,10 @@ > > type erasure > > min_size 3 > > max_size 6 > > - step set_chooseleaf_tries 50 > > - step set_choose_tries 200 > > + step set_chooseleaf_tries 5 > > + step set_choose_tries 100 > > step take default > > - step choose indep 0 type osd > > + step chooseleaf indep 0 type osd > > step emit > > } > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > osdmaptool: osdmap file 'osdmap.bin2' > > parsed '4.1c' -> 4.1c > > 4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], > > p6) > > > > > > -- dan > > > > On Tue, Aug 30, 2022 at 11:50 AM Dan van der Ster <dvand...@gmail.com> > > wrote: > > > > > > BTW, I vaguely recalled seeing this before. Yup, found it: > > > https://tracker.ceph.com/issues/55169 > > > > > > On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster <dvand...@gmail.com> > > > wrote: > > > > > > > > > 2. osd.7 is destroyed but still "up" in the osdmap. > > > > > > > > Oops, you can ignore this point -- this was an observation I had while > > > > playing with the osdmap -- your osdmap.bin has osd.7 down correctly. > > > > > > > > In case you're curious, here was what confused me: > > > > > > > > # osdmaptool osdmap.bin2 --mark-up-in --mark-out 7 --dump plain > > > > osd.7 up out weight 0 up_from 3846 up_thru 3853 down_at 3855 > > > > last_clean_interval [0,0) > > > > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819] > > > > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819] > > > > destroyed,exists,up > > > > > > > > Just ignore this ... > > > > > > > > > > > > > > > > -- dan > > > > > > > > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster <dvand...@gmail.com> > > > > wrote: > > > > > > > > > > Hi Frank, > > > > > > > > > > I suspect this is a combination of issues. > > > > > 1. You have "choose" instead of "chooseleaf" in rule 1. > > > > > 2. osd.7 is destroyed but still "up" in the osdmap. > > > > > 3. The _tries settings in rule 1 are not helping. > > > > > > > > > > Here are my tests: > > > > > > > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin > > > > > osdmaptool: osdmap file 'osdmap.bin' > > > > > parsed '4.1c' -> 4.1c > > > > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) > > > > > acting ([6,1,4,5,3,1], p6) > > > > > > > > > > ^^ This is what you observe now. > > > > > > > > > > # diff -u crush.txt crush.txt2 > > > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > > > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200 > > > > > @@ -93,7 +93,7 @@ > > > > > step set_chooseleaf_tries 50 > > > > > step set_choose_tries 200 > > > > > step take default > > > > > - step choose indep 0 type osd > > > > > + step chooseleaf indep 0 type osd > > > > > step emit > > > > > } > > > > > # crushtool -c crush.txt2 -o crush.map2 > > > > > # cp osdmap.bin osdmap.bin2 > > > > > # osdmaptool --import-crush crush.map2 osdmap.bin2 > > > > > osdmaptool: osdmap file 'osdmap.bin2' > > > > > osdmaptool: imported 1166 byte crush map from crush.map2 > > > > > osdmaptool: writing epoch 4990 to osdmap.bin2 > > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > > > > osdmaptool: osdmap file 'osdmap.bin2' > > > > > parsed '4.1c' -> 4.1c > > > > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting > > > > > ([6,1,4,5,3,1], p6) > > > > > > > > > > ^^ The mapping is now "correct" in that it doesn't duplicate the > > > > > mapping to osd.1. However it tries to use osd.7 which is destroyed but > > > > > up. > > > > > > > > > > You might be able to fix that by fully marking osd.7 out. > > > > > I can also get a good mapping by removing the *_tries settings from > > > > > rule 1: > > > > > > > > > > # diff -u crush.txt crush.txt2 > > > > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200 > > > > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200 > > > > > @@ -90,10 +90,8 @@ > > > > > type erasure > > > > > min_size 3 > > > > > max_size 6 > > > > > - step set_chooseleaf_tries 50 > > > > > - step set_choose_tries 200 > > > > > step take default > > > > > - step choose indep 0 type osd > > > > > + step chooseleaf indep 0 type osd > > > > > step emit > > > > > } > > > > > ... > > > > > # osdmaptool --test-map-pg 4.1c osdmap.bin2 > > > > > osdmaptool: osdmap file 'osdmap.bin2' > > > > > parsed '4.1c' -> 4.1c > > > > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting > > > > > ([6,1,4,5,3,1], p6) > > > > > > > > > > Note that I didn't need to adjust the reweights: > > > > > > > > > > # osdmaptool osdmap.bin2 --tree > > > > > osdmaptool: osdmap file 'osdmap.bin2' > > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > > > > > -1 2.44798 root default > > > > > -7 0.81599 host tceph-01 > > > > > 0 hdd 0.27199 osd.0 up 0.87999 1.00000 > > > > > 3 hdd 0.27199 osd.3 up 0.98000 1.00000 > > > > > 6 hdd 0.27199 osd.6 up 0.92999 1.00000 > > > > > -3 0.81599 host tceph-02 > > > > > 2 hdd 0.27199 osd.2 up 0.95999 1.00000 > > > > > 4 hdd 0.27199 osd.4 up 0.89999 1.00000 > > > > > 8 hdd 0.27199 osd.8 up 0.89999 1.00000 > > > > > -5 0.81599 host tceph-03 > > > > > 1 hdd 0.27199 osd.1 up 0.89999 1.00000 > > > > > 5 hdd 0.27199 osd.5 up 1.00000 1.00000 > > > > > 7 hdd 0.27199 osd.7 destroyed 0 1.00000 > > > > > > > > > > > > > > > Does this work in real life? > > > > > > > > > > Cheers, Dan > > > > > > > > > > > > > > > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder <fr...@dtu.dk> wrote: > > > > > > > > > > > > Hi Dan, > > > > > > > > > > > > please find attached (only 7K, so I hope it goes through). > > > > > > md5sum=1504652f1b95802a9f2fe3725bf1336e > > > > > > > > > > > > I was playing a bit around with the crush map and found out the > > > > > > following: > > > > > > > > > > > > 1) Setting all re-weights to 1 does produce valid mappings. > > > > > > However, it will lead to large imbalances and is impractical in > > > > > > operations. > > > > > > > > > > > > 2) Doing something as simple/stupid as the following also results > > > > > > in valid mappings without having to change the weights: > > > > > > > > > > > > rule fs-data { > > > > > > id 1 > > > > > > type erasure > > > > > > min_size 3 > > > > > > max_size 6 > > > > > > step set_chooseleaf_tries 50 > > > > > > step set_choose_tries 200 > > > > > > step take default > > > > > > step chooseleaf indep 3 type host > > > > > > step emit > > > > > > step take default > > > > > > step chooseleaf indep -3 type host > > > > > > step emit > > > > > > } > > > > > > > > > > > > rule fs-data { > > > > > > id 1 > > > > > > type erasure > > > > > > min_size 3 > > > > > > max_size 6 > > > > > > step set_chooseleaf_tries 50 > > > > > > step set_choose_tries 200 > > > > > > step take default > > > > > > step choose indep 3 type osd > > > > > > step emit > > > > > > step take default > > > > > > step choose indep -3 type osd > > > > > > step emit > > > > > > } > > > > > > > > > > > > Of course, now the current weights are probably unsuitable as > > > > > > everything moves around. Its probably also a lot more total tries > > > > > > to get rid of mappings with duplicate OSDs. > > > > > > > > > > > > I probably have to read the code to understand how drawing straws > > > > > > from 8 different buckets with non-zero probabilities can lead to an > > > > > > infinite sequence of failed attempts of getting 6 different ones. > > > > > > There seems to be a hard-coded tunable that turns seemingly > > > > > > infinite into finite somehow. > > > > > > > > > > > > The first modified rule will probably lead to better distribution > > > > > > of load, but bad distribution of data if a disk goes down > > > > > > (considering the tiny host- and disk numbers). The second rule > > > > > > seems to be almost as good or bad as the default one (step choose > > > > > > indep 0 type osd), except that it does produce valid mappings where > > > > > > the default rule fails. > > > > > > > > > > > > I will wait with changing the rule in the hope that you find a more > > > > > > elegant solution to this riddle. > > > > > > > > > > > > Best regards, > > > > > > ================= > > > > > > Frank Schilder > > > > > > AIT Risø Campus > > > > > > Bygning 109, rum S14 > > > > > > > > > > > > ________________________________________ > > > > > > From: Dan van der Ster <dvand...@gmail.com> > > > > > > Sent: 29 August 2022 19:13 > > > > > > To: Frank Schilder > > > > > > Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the > > > > > > same OSD twice. > > > > > > > > > > > > Hi Frank, > > > > > > > > > > > > Could you share the osdmap so I can try to solve this riddle? > > > > > > > > > > > > Cheers , Dan > > > > > > > > > > > > > > > > > > On Mon, Aug 29, 2022, 17:26 Frank Schilder > > > > > > <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote: > > > > > > Hi Dan, > > > > > > > > > > > > thanks for your answer. I'm not really convinced that we hit a > > > > > > corner case here and even if its one, it seems quite relevant for > > > > > > production clusters. The usual way to get a valid mapping is to > > > > > > increase the number of tries. I increased the following max trial > > > > > > numbers, which I would expect to produce a mapping for all PGs: > > > > > > > > > > > > # diff map-now.txt map-new.txt > > > > > > 4c4 > > > > > > < tunable choose_total_tries 50 > > > > > > --- > > > > > > > tunable choose_total_tries 250 > > > > > > 93,94c93,94 > > > > > > < step set_chooseleaf_tries 5 > > > > > > < step set_choose_tries 100 > > > > > > --- > > > > > > > step set_chooseleaf_tries 50 > > > > > > > step set_choose_tries 200 > > > > > > > > > > > > When I test the map with crushtool it does not report bad mappings. > > > > > > Am I looking at the wrong tunables to increase? It should be > > > > > > possible to get valid mappings without having to modify the > > > > > > re-weights. > > > > > > > > > > > > Thanks again for your help! > > > > > > ================= > > > > > > Frank Schilder > > > > > > AIT Risø Campus > > > > > > Bygning 109, rum S14 > > > > > > > > > > > > ________________________________________ > > > > > > From: Dan van der Ster > > > > > > <dvand...@gmail.com<mailto:dvand...@gmail.com>> > > > > > > Sent: 29 August 2022 16:52:52 > > > > > > To: Frank Schilder > > > > > > Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io> > > > > > > Subject: Re: [ceph-users] Bug in crush algorithm? 1 PG with the > > > > > > same OSD twice. > > > > > > > > > > > > Hi Frank, > > > > > > > > > > > > CRUSH can only find 5 OSDs, given your current tree, rule, and > > > > > > reweights. This is why there is a NONE in the UP set for shard 6. > > > > > > But in ACTING we see that it is refusing to remove shard 6 from > > > > > > osd.1 > > > > > > -- that is the only copy of that shard, so in this case it's helping > > > > > > you rather than deleting the shard altogether. > > > > > > ACTING == what the OSDs are serving now. > > > > > > UP == where CRUSH wants to place the shards. > > > > > > > > > > > > I suspect that this is a case of CRUSH tunables + your reweights > > > > > > putting CRUSH in a corner case of not finding 6 OSDs for that > > > > > > particular PG. > > > > > > If you set the reweights all back to 1, it probably finds 6 OSDs? > > > > > > > > > > > > Cheers, Dan > > > > > > > > > > > > > > > > > > On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder > > > > > > <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote: > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > I'm investigating a problem with a degenerated PG on an octopus > > > > > > > 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool > > > > > > > with failure domain OSD. After simulating a disk fail by removing > > > > > > > an OSD and letting the cluster recover (all under load), I end up > > > > > > > with a PG with the same OSD allocated twice: > > > > > > > > > > > > > > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1] > > > > > > > > > > > > > > OSD 1 is allocated twice. How is this even possible? > > > > > > > > > > > > > > Here the OSD tree: > > > > > > > > > > > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT > > > > > > > PRI-AFF > > > > > > > -1 2.44798 root default > > > > > > > -7 0.81599 host tceph-01 > > > > > > > 0 hdd 0.27199 osd.0 up 0.87999 > > > > > > > 1.00000 > > > > > > > 3 hdd 0.27199 osd.3 up 0.98000 > > > > > > > 1.00000 > > > > > > > 6 hdd 0.27199 osd.6 up 0.92999 > > > > > > > 1.00000 > > > > > > > -3 0.81599 host tceph-02 > > > > > > > 2 hdd 0.27199 osd.2 up 0.95999 > > > > > > > 1.00000 > > > > > > > 4 hdd 0.27199 osd.4 up 0.89999 > > > > > > > 1.00000 > > > > > > > 8 hdd 0.27199 osd.8 up 0.89999 > > > > > > > 1.00000 > > > > > > > -5 0.81599 host tceph-03 > > > > > > > 1 hdd 0.27199 osd.1 up 0.89999 > > > > > > > 1.00000 > > > > > > > 5 hdd 0.27199 osd.5 up 1.00000 > > > > > > > 1.00000 > > > > > > > 7 hdd 0.27199 osd.7 destroyed 0 > > > > > > > 1.00000 > > > > > > > > > > > > > > I tried already to change some tunables thinking about > > > > > > > https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, > > > > > > > but giving up too soon is obviously not the problem. It is > > > > > > > accepting a wrong mapping. > > > > > > > > > > > > > > Is there a way out of this? Clearly this is calling for trouble > > > > > > > if not data loss and should not happen at all. > > > > > > > > > > > > > > Best regards, > > > > > > > ================= > > > > > > > Frank Schilder > > > > > > > AIT Risø Campus > > > > > > > Bygning 109, rum S14 > > > > > > > _______________________________________________ > > > > > > > ceph-users mailing list -- > > > > > > > ceph-users@ceph.io<mailto:ceph-users@ceph.io> > > > > > > > To unsubscribe send an email to > > > > > > > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io