I’m hesitant to do this, too. I think I’ll pass and just wait for the remapping. :-)
George > On Jun 5, 2020, at 12:58 PM, Frank Schilder <fr...@dtu.dk> wrote: > > I never changed IDs before, I'm just extra cautious. If they do not show up > explicitly anywhere else than inside the bucket definitions, then it is > probably an easy edit and just swapping them. > > If you try this, could you please report back to the list if it works as > expected, maybe with example crush maps/items included to illustrate the > edits for documentation purposes? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Kyriazis, George <george.kyria...@intel.com> > Sent: 05 June 2020 17:46 > To: Frank Schilder > Cc: ceph-users; Wido den Hollander > Subject: Re: Best way to change bucket hierarchy > > Hmm, > > From what I see in the crush map, “nodes” refers to other “nodes” by name, > not by ID. In fact, I don’t see anything in the crush map referred to by ID. > As we said before, though, the crush algorithm figures out the hashes based > on the IDs. I am not sure what else refers to them, though (outside the > crush map) to make sure the references are correct. > > Thanks, > > George > > >> On Jun 5, 2020, at 10:32 AM, Frank Schilder <fr...@dtu.dk> wrote: >> >> Wido replied to you, check this thread. >> >> You really need to understand the file you get exactly. The IDs are used to >> refer to items from within other items. You need to make sure that any such >> cross-reference is updated as well. It is not just changing the ID tag in a >> bucket item, you also need to update all places that refer to a bucket by >> ID. The crush map defines a tree structure and a wrong reference can get you >> into serious trouble. >> >> Before attempting anything like this, make sure you have a backup of the >> original crush map (in several places). >> >> Generally speaking, your tweaking of the crush map is maybe a bi premature. >> You wrote you want to add quite a number of servers. Why don't you do the >> crush map change together with that? All the data will be reshuffled then >> any ways. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Kyriazis, George <george.kyria...@intel.com> >> Sent: 05 June 2020 17:21 >> To: Frank Schilder >> Cc: ceph-users; Wido den Hollander >> Subject: Re: Best way to change bucket hierarchy >> >> Hmm, >> >> Sounds quite dangerous. On the other hand, and from prior experience, it >> could take weeks/months for the cluster to rebalance, so I give it a try. >> >> From the looks of it, there is no other reference to IDs, is that correct? >> Just swap IDs between chassis and host and I should be OK? (Sorry, I’m not >> following the list closely, so I am not aware of Wido’s procedure). >> >> Thanks, >> >> George >> >> >>> On Jun 5, 2020, at 1:29 AM, Frank Schilder <fr...@dtu.dk> wrote: >>> >>> Hi George, >>> >>> yes, I believe your interpretation is correct. because the chassis buckets >>> have new bucket IDs, the distribution hashing will change. I also believe >>> that the trick to avoid data movement in your situation is, to export the >>> new crush map, swap the IDs between corresponding host and bucket in *all* >>> (!!!) occurrences and import. This is possible because currently you have >>> the special case of a one-to-one correspondence between hosts and chassis. >>> >>> This would be the procedure Wido explained and there is no other choice for >>> this edit. >>> >>> If you want to do that depends on how far you are into the data movement. >>> If its almost done, I wouldn't bother. If its another month, it might be >>> worth trying. As far as I can see, your crush map is going to be a short >>> text file, so it should be feasible to edit. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Kyriazis, George <george.kyria...@intel.com> >>> Sent: 05 June 2020 01:36 >>> To: Frank Schilder >>> Cc: ceph-users >>> Subject: Re: Best way to change bucket hierarchy >>> >>> Understand that it’s difficult to debug remotely. :-) >>> >>> In my current scenario I have 5 machines (1 host per chassis), but planning >>> on adding some additional chassis with 4 hosts per chassis in the near >>> future. Currently I am going through the first stage of adding “stub” >>> chassis for the 5 hosts/chassis that I have, basically reparenting each >>> host to its own chassis, as shown below: >>> >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 203.72598 root default >>> -5 40.01700 chassis chassis-hsw1 >>> -9 40.01700 host vis-hsw-01 >>> 3 hdd 10.91299 osd.3 up 1.00000 1.00000 >>> 6 hdd 14.55199 osd.6 up 1.00000 1.00000 >>> 10 hdd 14.55199 osd.10 up 1.00000 1.00000 >>> -6 40.01700 chassis chassis-hsw2 >>> -13 40.01700 host vis-hsw-02 >>> 0 hdd 10.91299 osd.0 up 1.00000 1.00000 >>> 7 hdd 14.55199 osd.7 up 1.00000 1.00000 >>> 11 hdd 14.55199 osd.11 up 1.00000 1.00000 >>> -7 40.01700 chassis chassis-hsw3 >>> -11 40.01700 host vis-hsw-03 >>> 4 hdd 10.91299 osd.4 up 1.00000 1.00000 >>> 8 hdd 14.55199 osd.8 up 1.00000 1.00000 >>> 12 hdd 14.55199 osd.12 up 1.00000 1.00000 >>> -8 40.01700 chassis chassis-hsw4 >>> -3 40.01700 host vis-hsw-04 >>> 5 hdd 10.91299 osd.5 up 1.00000 1.00000 >>> 9 hdd 14.55199 osd.9 up 1.00000 1.00000 >>> 13 hdd 14.55199 osd.13 up 1.00000 1.00000 >>> -17 43.65799 chassis chassis-hsw5 >>> -15 43.65799 host vis-hsw-05 >>> 1 hdd 14.55299 osd.1 up 1.00000 1.00000 >>> 2 hdd 14.55299 osd.2 up 1.00000 1.00000 >>> 14 hdd 14.55299 osd.14 up 1.00000 1.00000 >>> >>> There is no additional constraint that is being added, so ideally there >>> would be no data movement. However, I can imagine that the CRUSH algorithm >>> could hash the PGs into different OSDs now because there is a new thing to >>> consider (namely the chassis). Does it do that? >>> >>> Thanks, >>> >>> George >>> >>> >>> On Jun 4, 2020, at 6:22 PM, Frank Schilder >>> <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote: >>> >>> Its hard to tell without knowing what the diff is, but from your >>> description I take it that you changed the failure domain for every(?) pool >>> from host to chassis. I don't know what a chassis is in your architecture, >>> but if each chassis contains several host buckets, then yes, I would expect >>> almost every PG to be affected. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Kyriazis, George >>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com>> >>> Sent: 05 June 2020 00:28:43 >>> To: Frank Schilder >>> Cc: ceph-users >>> Subject: Re: Best way to change bucket hierarchy >>> >>> Hmm, >>> >>> So I tried all that, and I got almost all of my PGs being remapped. Crush >>> map looks correct. Is that normal? >>> >>> Thanks, >>> >>> George >>> >>> >>> On Jun 4, 2020, at 2:33 PM, Frank Schilder >>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>> wrote: >>> >>> Hi George, >>> >>> you don't need to worry about that too much. The EC profile contains two >>> types of information, one part about the actual EC encoding and another >>> part about crush parameters. Unfortunately, actually. Part of this >>> information is mutable after pool creation while the rest is not. Mutable >>> here means outside of the profile. You can change the failure domain in the >>> crush map without issues, but the profile won't reflect that change. That's >>> an inconsistency we currently have to live with and it would have been >>> better to separate mutable data (like failure domain) from immutable data >>> (like k and m) or provide a meaningful interface to maintain consistency of >>> mutable information. >>> >>> In short, don't believe everything the EC profile tells you. Some >>> information might be out of date, like the failure domain or the device >>> class (basically everything starting with crush-). If you remember that, >>> you are out of trouble. Always dump the crush rule of an EC pool explicitly >>> to see the true parameters in action. >>> >>> Having said that, to change the failure domain for an EC pool, change the >>> crush rule for the EC profile - I did this too and it works just fine. The >>> crush rule has by default the same name as the pool. I'm afraid, here you >>> will have to do a manual edit of the crush rule as Wido explained. There is >>> no other way - at least currently not. >>> >>> You can ask in this list for confirmation that your change is doing what >>> you want. >>> >>> Do not try to touch an EC profile, they are read-only any ways. The crush >>> parameters are only used at pool creation and never looked at again. You >>> can override these by editing the crush rule as explained above. >>> >>> Best regards and good luck, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Kyriazis, George >>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>> >>> Sent: 04 June 2020 20:56:38 >>> To: Frank Schilder >>> Cc: ceph-users >>> Subject: Re: Best way to change bucket hierarchy >>> >>> Thanks Frank, >>> >>> Interesting info about the EC profile. I do have an EC pool, but I noticed >>> the following when I dumped the profile: >>> >>> # ceph osd erasure-code-profile get ec22 >>> crush-device-class=hdd >>> crush-failure-domain=host >>> crush-root=default >>> jerasure-per-chunk-alignment=false >>> k=2 >>> m=2 >>> plugin=jerasure >>> technique=reed_sol_van >>> w=8 >>> # >>> >>> Which says that the failure domain of the EC profile is also set to host. >>> Looks like I need to change the EC profile, too, but since it associated >>> with the pool, maybe I can’t do that after pool creation? Or…. Since it >>> the property is named “crush-failure-domain”, it’s automatically inherited >>> from the crush profile, so I don’t have to do anything? >>> >>> Thanks, >>> >>> George >>> >>> >>> On Jun 4, 2020, at 1:51 AM, Frank Schilder >>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>> >>> wrote: >>> >>> Hi George, >>> >>> for replicated rules you can simply create a new crush rule with the new >>> failure domain set to chassis and change any pool's crush rule to this new >>> one. If you have EC pools, then the chooseleaf needs to be edited by hand. >>> I did this before as well. (A really unfortunate side effect is, that the >>> EC profile attached to the pool goes out of sync with the crush map and >>> there is nothing one can do about that. This is annoying yet harmless.) >>> >>> The intend of doing these changes while norebalance is set is >>> >>> - to avoid unnecessary data movement due to successive changes happening >>> step by step and >>> - to make sure peering is successful before starting to move data. >>> >>> I believe OSDs peer a bit faster with norebalance set and there is then a >>> shorter interrupt to ongoing I/O (no I/O happens to a PG during peering). >>> >>> Yes, if you safe the old crush map, you can undo everything. It is a good >>> idea to have a backup also just for reference and to compare before and >>> after. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Kyriazis, George >>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>> >>> Sent: 04 June 2020 00:58:20 >>> To: Frank Schilder >>> Cc: ceph-users >>> Subject: Re: Best way to change bucket hierarchy >>> >>> Thanks Frank, >>> >>> I don’t have too much experience editing crush rules, but I assume the >>> chooseleaf step would also have to change to: >>> >>> step chooseleaf firstn 0 type chassis >>> >>> Correct? Is that the only other change that is needed? It looks like the >>> rule change can happen both inside and outside the “norebalance” setting >>> (again with CLI commands), but is it safer to do it inside (ie. while not >>> rebalancing)? >>> >>> If I keep a backup of the crush rule map (with “ceph osd getcrushmap”), I >>> assume I can restore the old map if something goes bad? >>> >>> Thanks again! >>> >>> George >>> >>> >>> >>> On Jun 3, 2020, at 5:24 PM, Frank Schilder >>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>> >>> wrote: >>> >>> You can use the command-line without editing the crush map. Look at the >>> documentation of commands like >>> >>> ceph osd crush add-bucket ... >>> ceph osd crush move ... >>> >>> Before starting this, set "ceph osd set norebalance" and unset after you >>> are happy with the crush tree. Let everything peer. You should see >>> misplaced objects and remapped PGs, but no degraded objects or PGs. >>> >>> Do this only when cluster is helth_ok, otherwise things can get really >>> complicated. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Kyriazis, George >>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>> >>> Sent: 03 June 2020 22:45:11 >>> To: ceph-users >>> Subject: [ceph-users] Best way to change bucket hierarchy >>> >>> Helo, >>> >>> I have a live ceph cluster, and I’m in the need of modifying the bucket >>> hierarchy. I am currently using the default crush rule (ie. keep each >>> replica on a different host). My need is to add a “chassis” level, and >>> keep replicas on a per-chassis level. >>> >>> From what I read in the documentation, I would have to edit the crush file >>> manually, however this sounds kinda scary for a live cluster. >>> >>> Are there any “best known methods” to achieve that goal without messing >>> things up? >>> >>> In my current scenario, I have one host per chassis, and planning on later >>> adding nodes where there would be >1 hosts per chassis. It looks like “in >>> theory” there wouldn’t be a need for any data movement after the crush map >>> changes. Will reality match theory? Anything else I need to watch out for? >>> >>> Thank you! >>> >>> George >>> >>> _______________________________________________ >>> ceph-users mailing list -- >>> ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io> >>> To unsubscribe send an email to >>> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io> >>> >>> >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io