Wow, lots of good information there!  The hosts are the failure domain, so 
normally we just do one host at a time as you suggested.  Good point on the 
rebalance waiting, its usually just me waiting for someone to go replace the 
failed drive, though in the case of swapping out old working drives for new 
ones, I wanted to be more cautious since we have been burned in the past with 
data loss due to unexpected mutli-disk failures during the data movement.  
Luckily, its been a small amount of data loss though ( like one or two pg ).  
The older drives are from 2018ish and we are not tight on space thankfully.  We 
do plan to replace all 4TB with 20TB+, trying to decide on who to go with cause 
we have had a lot of failing WD drives right out of the box as of late.  
Seagate drives haven’t proven much better.  We are keeping the WD RMA group 
busy.

We do have enough space on the two clusters to drain an entire host for disk 
swaps, I was just trying to avoid an imbalance(but we should be fine assuming 
it spreads that load evenly).  I will look into the balancer bits as we have 
turned that on in both clusters.  Its ok but not the best(I noticed someone 
mentioned a custom script they were using on the mailing list, wonder if that’s 
the same one).  I don’t have any issues with disabling it while we do this 
though ( good idea! ).  Your comments on the crush stuff jolted me back into 
reality with that, so thanks, so true.  Also, had no idea about the monitor 
setting related to mon_max_pg_per_osd!  We have about 170 drives/osds per 
cluster and I recently rightsized all the pgs in all clusters.  The RGWs are 
VMs, so we can spin up more if need be( will keep an eye on that ), we had 
settled on 4 for redundancy, but in times of high load, sometimes they fall 
over ( usually it’s a cleanup script causing it though ).  Funny you mention 
the bulk flag, I just enabled that before I rightsized the pg counts!  Was 
doing some tuning since we finally managed to get everything on the latest 
version of ceph.  Oddly, the autoscaler was not doing its job.  Not a huge fan 
of trying to calculate a moving target.

I have some homework, so thanks again for the feedback, really appreciate 
everyone's time here on the list!  Lots of good information, maybe one of these 
days I will make it to a ceph conference.

Regards,
Brent

-----Original Message-----
From: Anthony D'Atri <anthony.da...@gmail.com> 
Sent: Monday, November 11, 2024 8:41 PM
To: bre...@cfl.rr.com
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Cephadm Drive upgrade process


> 1.    Pulled failed drive ( after troubleshooting of course )
> 
> 2.    Cephadm gui - find OSD, purge osd
> 3.    Wait for rebalance
> 4.    Insert new drive ( let cluster rebalance after it automatically adds
> the drive as an OSD ) ( yes, we have auto-add on in the clusters )


> I imagine with an existing good drive, we would use delete instead of 
> purge, but the process would be the similar, except the drive swap 
> would happen after the data was moved.

You don’t have to wait for the rebalance / backfill / recovery, at least if you 
do one drive (or failure domain) at a time.

In fact you can be more efficient by not waiting, as deploying the new OSD will 
short-circuit some of that data movement from the deletion.

>  Would the replace flag ( or keep OSD option in gui ) allow us to 
> avoid the initial rebalance by backfilling the new drive with the old drives 
> content?

Only if you set the new drive’s CRUSH weight artificially low, to match the old 
drive’s weight exactly.  But when you weight it up fully, data will move anyway.

> I would be nice if we could just copy the content to the new drive and go 
> from there.

I get your drift, but there’s a nuance.  Because of how CRUSH works, the data 
that the 20TB OSD will eventually hold will not be a proper superset of what’s 
on the 4TB OSD today.  Data will also shuffle on other OSDs.

Be careful that you only delete / destroy OSDs in a single failure domain at a 
time, and wait for full recovery before proceeding to the next failure domain.  
 If you are short on capacity, you might want to do a small number of drives in 
one failure domain, wait for recovery, then move to the next failure domain, as 
you will only realize additional cluster capacity once you’ve added CRUSH 
weight to at least 3 failure domains.

> We would like to avoid lots of read/write cluster recovery activity if 
> possible sine we could be replacing
> 40+ drives in each cluster.

Part of the false economy of HDDs :-/. But again, attend to your failure 
domains.  If you are only replacing *some* off the smaller drives, spread that 
across failure domains or you won’t gain any actual capacity.  And be prepared 
for the larger drives to get a proportionally larger fraction of workload, 
which can be somewhat mitigated with primary affinity but that’s a bit of an 
advanced topic.

Related advice:  as you add OSDs that are 5x the size of existing OSDs, you run 
the risk of hitting mon_max_pg_per_osd on the larger OSDs.  This defaults to 
250.  I suggest setting it to 1000 before starting this project, to avoid 
larger OSDs that won’t activate.

https://github.com/ceph/ceph/pull/60492

Also, you might temporarily disable the balancer and use pgremapper or 
https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
 to minimize extra backfill and control its rate.  You would use the tool to 
effective freeze all PG mappings, then destroy / redeploy as many OSDs as you 
like *within a single failure domain*, and gradually remove the manual upmaps 
at a rate informed by how much backfill your spinners can handle.  There are 
lots of articles and list posts about this strategy.  This lets you leapfrog 
the transient churn as multiple OSDs are removed / added, and control the 
thundering herd of recovery that can DoS spinners.

If you’re using the pg autoscaler, I might ensure that all affected pools have 
the ‘bulk’ flag set in advance, so that you don’t have PG splitting / merging 
and backfill/recovery going on at the same time.


> US Production(HDD): Reef 18.2.4 Cephadm with 11 osd servers, 5 mons, 4 
> rgw,
> 2 iscsigw, 2 mds
> 
> UK Production(HDD): Reef 18.2.4 Cephadm with 20 osd servers, 5 mons, 4 
> rgw,
> 2 iscsigw, 2mds
> 
> US Production(SSD): Reef 18.2.4 Cephadm with 6 osd servers, 5 mons, 4 
> rgw, 2 mds
> 
> UK Production(SSD): Reef 18.2.4 cephadm with 6 osd servers, 5 mons, 4 
> rgw, 2 mds

I suspect FWIW that you would benefit from running an RGW on more servers - any 
of them that have enough CPU / RAM.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to