[ceph-users] Re: Best way to change bucket hierarchy

Kyriazis, George Fri, 05 Jun 2020 11:14:12 -0700

I’m hesitant to do this, too.

I think I’ll pass and just wait for the remapping. :-)


George


> On Jun 5, 2020, at 12:58 PM, Frank Schilder <fr...@dtu.dk> wrote:
> 
> I never changed IDs before, I'm just extra cautious. If they do not show up 
> explicitly anywhere else than inside the bucket definitions, then it is 
> probably an easy edit and just swapping them.
> 
> If you try this, could you please report back to the list if it works as 
> expected, maybe with example crush maps/items included to illustrate the 
> edits for documentation purposes?
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Kyriazis, George <george.kyria...@intel.com>
> Sent: 05 June 2020 17:46
> To: Frank Schilder
> Cc: ceph-users; Wido den Hollander
> Subject: Re: Best way to change bucket hierarchy
> 
> Hmm,
> 
> From what I see in the crush map, “nodes” refers to other “nodes” by name, 
> not by ID.  In fact, I don’t see anything in the crush map referred to by ID. 
>  As we said before, though, the crush algorithm figures out the hashes based 
> on the IDs.  I am not sure what else refers to them, though (outside the 
> crush map) to make sure the references are correct.
> 
> Thanks,
> 
> George
> 
> 
>> On Jun 5, 2020, at 10:32 AM, Frank Schilder <fr...@dtu.dk> wrote:
>> 
>> Wido replied to you, check this thread.
>> 
>> You really need to understand the file you get exactly. The IDs are used to 
>> refer to items from within other items. You need to make sure that any such 
>> cross-reference is updated as well. It is not just changing the ID tag in a 
>> bucket item, you also need to update all places that refer to a bucket by 
>> ID. The crush map defines a tree structure and a wrong reference can get you 
>> into serious trouble.
>> 
>> Before attempting anything like this, make sure you have a backup of the 
>> original crush map (in several places).
>> 
>> Generally speaking, your tweaking of the crush map is maybe a bi premature. 
>> You wrote you want to add quite a number of servers. Why don't you do the 
>> crush map change together with that? All the data will be reshuffled then 
>> any ways.
>> 
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> ________________________________________
>> From: Kyriazis, George <george.kyria...@intel.com>
>> Sent: 05 June 2020 17:21
>> To: Frank Schilder
>> Cc: ceph-users; Wido den Hollander
>> Subject: Re: Best way to change bucket hierarchy
>> 
>> Hmm,
>> 
>> Sounds quite dangerous.  On the other hand, and from prior experience, it 
>> could take weeks/months for the cluster to rebalance, so I give it a try.
>> 
>> From the looks of it, there is no other reference to IDs, is that correct?  
>> Just swap IDs between chassis and host and I should be OK?  (Sorry, I’m not 
>> following the list closely, so I am not aware of Wido’s procedure).
>> 
>> Thanks,
>> 
>> George
>> 
>> 
>>> On Jun 5, 2020, at 1:29 AM, Frank Schilder <fr...@dtu.dk> wrote:
>>> 
>>> Hi George,
>>> 
>>> yes, I believe your interpretation is correct. because the chassis buckets 
>>> have new bucket IDs, the distribution hashing will change. I also believe 
>>> that the trick to avoid data movement in your situation is, to export the 
>>> new crush map, swap the IDs between corresponding host and bucket in *all* 
>>> (!!!) occurrences and import. This is possible because currently you have 
>>> the special case of a one-to-one correspondence between hosts and chassis.
>>> 
>>> This would be the procedure Wido explained and there is no other choice for 
>>> this edit.
>>> 
>>> If you want to do that depends on how far you are into the data movement. 
>>> If its almost done, I wouldn't bother. If its another month, it might be 
>>> worth trying. As far as I can see, your crush map is going to be a short 
>>> text file, so it should be feasible to edit.
>>> 
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Kyriazis, George <george.kyria...@intel.com>
>>> Sent: 05 June 2020 01:36
>>> To: Frank Schilder
>>> Cc: ceph-users
>>> Subject: Re: Best way to change bucket hierarchy
>>> 
>>> Understand that it’s difficult to debug remotely. :-)
>>> 
>>> In my current scenario I have 5 machines (1 host per chassis), but planning 
>>> on adding some additional chassis with 4 hosts per chassis in the near 
>>> future.  Currently I am going through the first stage of adding “stub” 
>>> chassis for the 5 hosts/chassis that I have, basically reparenting each 
>>> host to its own chassis, as shown below:
>>> 
>>> ID  CLASS WEIGHT    TYPE NAME                STATUS REWEIGHT PRI-AFF
>>> -1       203.72598 root default
>>> -5        40.01700     chassis chassis-hsw1
>>> -9        40.01700         host vis-hsw-01
>>> 3   hdd  10.91299             osd.3            up  1.00000 1.00000
>>> 6   hdd  14.55199             osd.6            up  1.00000 1.00000
>>> 10   hdd  14.55199             osd.10           up  1.00000 1.00000
>>> -6        40.01700     chassis chassis-hsw2
>>> -13        40.01700         host vis-hsw-02
>>> 0   hdd  10.91299             osd.0            up  1.00000 1.00000
>>> 7   hdd  14.55199             osd.7            up  1.00000 1.00000
>>> 11   hdd  14.55199             osd.11           up  1.00000 1.00000
>>> -7        40.01700     chassis chassis-hsw3
>>> -11        40.01700         host vis-hsw-03
>>> 4   hdd  10.91299             osd.4            up  1.00000 1.00000
>>> 8   hdd  14.55199             osd.8            up  1.00000 1.00000
>>> 12   hdd  14.55199             osd.12           up  1.00000 1.00000
>>> -8        40.01700     chassis chassis-hsw4
>>> -3        40.01700         host vis-hsw-04
>>> 5   hdd  10.91299             osd.5            up  1.00000 1.00000
>>> 9   hdd  14.55199             osd.9            up  1.00000 1.00000
>>> 13   hdd  14.55199             osd.13           up  1.00000 1.00000
>>> -17        43.65799     chassis chassis-hsw5
>>> -15        43.65799         host vis-hsw-05
>>> 1   hdd  14.55299             osd.1            up  1.00000 1.00000
>>> 2   hdd  14.55299             osd.2            up  1.00000 1.00000
>>> 14   hdd  14.55299             osd.14           up  1.00000 1.00000
>>> 
>>> There is no additional constraint that is being added, so ideally there 
>>> would be no data movement.  However, I can imagine that the CRUSH algorithm 
>>> could hash the PGs into different OSDs now because there is a new thing to 
>>> consider (namely the chassis).  Does it do that?
>>> 
>>> Thanks,
>>> 
>>> George
>>> 
>>> 
>>> On Jun 4, 2020, at 6:22 PM, Frank Schilder 
>>> <fr...@dtu.dk<mailto:fr...@dtu.dk>> wrote:
>>> 
>>> Its hard to tell without knowing what the diff is, but from your 
>>> description I take it that you changed the failure domain for every(?) pool 
>>> from host to chassis. I don't know what a chassis is in your architecture, 
>>> but if each chassis contains several host buckets, then yes, I would expect 
>>> almost every PG to be affected.
>>> 
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Kyriazis, George 
>>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com>>
>>> Sent: 05 June 2020 00:28:43
>>> To: Frank Schilder
>>> Cc: ceph-users
>>> Subject: Re: Best way to change bucket hierarchy
>>> 
>>> Hmm,
>>> 
>>> So I tried all that, and I got almost all of my PGs being remapped.  Crush 
>>> map looks correct.  Is that normal?
>>> 
>>> Thanks,
>>> 
>>> George
>>> 
>>> 
>>> On Jun 4, 2020, at 2:33 PM, Frank Schilder 
>>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>> wrote:
>>> 
>>> Hi George,
>>> 
>>> you don't need to worry about that too much. The EC profile contains two 
>>> types of information, one part about the actual EC encoding and another 
>>> part about crush parameters. Unfortunately, actually. Part of this 
>>> information is mutable after pool creation while the rest is not. Mutable 
>>> here means outside of the profile. You can change the failure domain in the 
>>> crush map without issues, but the profile won't reflect that change. That's 
>>> an inconsistency we currently have to live with and it would have been 
>>> better to separate mutable data (like failure domain) from immutable data 
>>> (like k and m) or provide a meaningful interface to maintain consistency of 
>>> mutable information.
>>> 
>>> In short, don't believe everything the EC profile tells you. Some 
>>> information might be out of date, like the failure domain or the device 
>>> class (basically everything starting with crush-). If you remember that, 
>>> you are out of trouble. Always dump the crush rule of an EC pool explicitly 
>>> to see the true parameters in action.
>>> 
>>> Having said that, to change the failure domain for an EC pool, change the 
>>> crush rule for the EC profile - I did this too and it works just fine. The 
>>> crush rule has by default the same name as the pool. I'm afraid, here you 
>>> will have to do a manual edit of the crush rule as Wido explained. There is 
>>> no other way - at least currently not.
>>> 
>>> You can ask in this list for confirmation that your change is doing what 
>>> you want.
>>> 
>>> Do not try to touch an EC profile, they are read-only any ways. The crush 
>>> parameters are only used at pool creation and never looked at again. You 
>>> can override these by editing the crush rule as explained above.
>>> 
>>> Best regards and good luck,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Kyriazis, George 
>>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>>
>>> Sent: 04 June 2020 20:56:38
>>> To: Frank Schilder
>>> Cc: ceph-users
>>> Subject: Re: Best way to change bucket hierarchy
>>> 
>>> Thanks Frank,
>>> 
>>> Interesting info about the EC profile.  I do have an EC pool, but I noticed 
>>> the following when I dumped the profile:
>>> 
>>> # ceph osd erasure-code-profile get ec22
>>> crush-device-class=hdd
>>> crush-failure-domain=host
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=2
>>> m=2
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> #
>>> 
>>> Which says that the failure domain of the EC profile is also set to host.  
>>> Looks like I need to change the EC profile, too, but since it associated 
>>> with the pool, maybe I can’t do that after pool creation?  Or…. Since it 
>>> the property is named “crush-failure-domain”, it’s automatically inherited 
>>> from the crush profile, so I don’t have to do anything?
>>> 
>>> Thanks,
>>> 
>>> George
>>> 
>>> 
>>> On Jun 4, 2020, at 1:51 AM, Frank Schilder 
>>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>>
>>>  wrote:
>>> 
>>> Hi George,
>>> 
>>> for replicated rules you can simply create a new crush rule with the new 
>>> failure domain set to chassis and change any pool's crush rule to this new 
>>> one. If you have EC pools, then the chooseleaf needs to be edited by hand. 
>>> I did this before as well. (A really unfortunate side effect is, that the 
>>> EC profile attached to the pool goes out of sync with the crush map and 
>>> there is nothing one can do about that. This is annoying yet harmless.)
>>> 
>>> The intend of doing these changes while norebalance is set is
>>> 
>>> - to avoid unnecessary data movement due to successive changes happening 
>>> step by step and
>>> - to make sure peering is successful before starting to move data.
>>> 
>>> I believe OSDs peer a bit faster with norebalance set and there is then a 
>>> shorter interrupt to ongoing I/O (no I/O happens to a PG during peering).
>>> 
>>> Yes, if you safe the old crush map, you can undo everything. It is a good 
>>> idea to have a backup also just for reference and to compare before and 
>>> after.
>>> 
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Kyriazis, George 
>>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>>
>>> Sent: 04 June 2020 00:58:20
>>> To: Frank Schilder
>>> Cc: ceph-users
>>> Subject: Re: Best way to change bucket hierarchy
>>> 
>>> Thanks Frank,
>>> 
>>> I don’t have too much experience editing crush rules, but I assume the 
>>> chooseleaf step would also have to change to:
>>> 
>>>   step chooseleaf firstn 0 type chassis
>>> 
>>> Correct?  Is that the only other change that is needed?  It looks like the 
>>> rule change can happen both inside and outside the “norebalance” setting 
>>> (again with CLI commands), but is it safer to do it inside (ie. while not 
>>> rebalancing)?
>>> 
>>> If I keep a backup of the crush rule map (with “ceph osd getcrushmap”), I 
>>> assume I can restore the old map if something goes bad?
>>> 
>>> Thanks again!
>>> 
>>> George
>>> 
>>> 
>>> 
>>> On Jun 3, 2020, at 5:24 PM, Frank Schilder 
>>> <fr...@dtu.dk<mailto:fr...@dtu.dk><mailto:fr...@dtu.dk><mailto:fr...@dtu.dk>>
>>>  wrote:
>>> 
>>> You can use the command-line without editing the crush map. Look at the 
>>> documentation of commands like
>>> 
>>> ceph osd crush add-bucket ...
>>> ceph osd crush move ...
>>> 
>>> Before starting this, set "ceph osd set norebalance" and unset after you 
>>> are happy with the crush tree. Let everything peer. You should see 
>>> misplaced objects and remapped PGs, but no degraded objects or PGs.
>>> 
>>> Do this only when cluster is helth_ok, otherwise things can get really 
>>> complicated.
>>> 
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>> 
>>> ________________________________________
>>> From: Kyriazis, George 
>>> <george.kyria...@intel.com<mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com><mailto:george.kyria...@intel.com>>
>>> Sent: 03 June 2020 22:45:11
>>> To: ceph-users
>>> Subject: [ceph-users] Best way to change bucket hierarchy
>>> 
>>> Helo,
>>> 
>>> I have a live ceph cluster, and I’m in the need of modifying the bucket 
>>> hierarchy.  I am currently using the default crush rule (ie. keep each 
>>> replica on a different host).  My need is to add a “chassis” level, and 
>>> keep replicas on a per-chassis level.
>>> 
>>> From what I read in the documentation, I would have to edit the crush file 
>>> manually, however this sounds kinda scary for a live cluster.
>>> 
>>> Are there any “best known methods” to achieve that goal without messing 
>>> things up?
>>> 
>>> In my current scenario, I have one host per chassis, and planning on later 
>>> adding nodes where there would be >1 hosts per chassis. It looks like “in 
>>> theory” there wouldn’t be a need for any data movement after the crush map 
>>> changes.  Will reality match theory? Anything else I need to watch out for?
>>> 
>>> Thank you!
>>> 
>>> George
>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- 
>>> ceph-users@ceph.io<mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io><mailto:ceph-users@ceph.io>
>>> To unsubscribe send an email to 
>>> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io><mailto:ceph-users-le...@ceph.io>
>>> 
>>> 
>> 
> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best way to change bucket hierarchy

Reply via email to