some kernels (el7?) lie about being jewel until after they are blocked from 
connecting at jewel. then they report newer. Just fyi.

________________________________________
From: Anthony D'Atri <anthony.da...@gmail.com>
Sent: Tuesday, August 6, 2024 5:08 PM
To: Fabien Sirjean
Cc: ceph-users
Subject: [ceph-users] Re: What's the best way to add numerous OSDs?

Check twice before you click! This email originated from outside PNNL.


Since they’re 20TB, I’m going to assume that these are HDDs.

There are a number of approaches.  One common theme is to avoid rebalancing 
until after all have been added to the cluster and are up / in, otherwise you 
can end up with a storm of map updates and superfluous rebalancing.


One strategy is to set osd_crush_initial_weight = 0 temporarily, so that the 
OSDs when added won’t take any data yet.  Then when you’re ready you can set 
their CRUSH weights up to where they otherwise would be, and unset 
osd_crush_initial_weight so you don’t wonder what the heck is going on six 
months down the road.

Another is to add a staging CRUSH root.  If the new OSDs are all on new hosts, 
you can create CRUSH host buckets for them in advance so that when you create 
the OSDs they go there and again won’t immediately take data.  Then you can 
move the host buckets into the production root in quick succession.

Either way if you do want to add them to the cluster all at once, with HDDs 
you’ll want to limit the rate of backfill so you don’t DoS your clients.  One 
strategy is to leverage pg-upmap with a tool like 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

Note that to use pg-upmap safely, you will need to ensure that your clients are 
all at Luminous or later, in the case of CephFS I *think* that means kernel 
4.13 or later.  `ceph features` will I think give you that information.

An older method of spreading out the backfill thundering herd was to use a for 
loop to weight up the OSDs in increments of, say, 0.1 at a time, let the 
cluster settle, then repeat.  This strategy results in at least some data 
moving twice, so it’s less efficient.  Similarly you might add, say, one OSD 
per host at a time and let the cluster settle between iterations, which would 
also be less than ideally efficient.

— aad

> On Aug 6, 2024, at 11:08 AM, Fabien Sirjean <fsirj...@eddie.fdn.fr> wrote:
>
> Hello everyone,
>
> We need to add 180 20TB OSDs to our Ceph cluster, which currently consists of 
> 540 OSDs of identical size (replicated size 3).
>
> I'm not sure, though: is it a good idea to add all the OSDs at once? Or is it 
> better to add them gradually?
>
> The idea is to minimize the impact of rebalancing on the performance of 
> CephFS, which is used in production.
>
> Thanks in advance for your opinions and feedback 🙂
>
> Wishing you a great summer,
>
> Fabien
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to