One more question: What’s the output of 'ceph config get osd osd_max_backfills’ after setting osd_max_backfills? Looks like ceph-config might be showing the wrong configurations.
Best, Laimis J. > On 4 Jan 2025, at 18:05, Laimis Juzeliūnas <laimis.juzeliu...@oxylabs.io> > wrote: > > Hello Bruno, > > Interesting case, few observations. > > What’s the average size of your PGs? > Judging from the ceph status you have 1394 pls in total and 696TiB of used > storage, that’s roughly 500GB per pg if I’m not mistaken. > With the backfilling limits this results in a lot of time spent per single pg > due to its size. You could try increasing their number in the pools to have > lighter placement groups. > > Are you using mclock? If yes, you can try setting the profile to prioritise > recovery operations with 'ceph config set osd osd_mclock_profile > high_recovery_ops' > > The max backfills configuration is an interesting one - it should persist. > What happens if you set it through the Ceph UI? > > In general it looks like the balancer might be “fighting” with the manual OSD > balancing. > You could try turning it off and do the balancing yourself (this might be > helpful: https://github.com/laimis9133/plankton-swarm). > > Also probably known already but keep in mind erasure coded pools are known to > be on the slower side when it comes to any data movement due to additional > operations needed. > > > Best, > Laimis J. > > >> On 4 Jan 2025, at 13:18, bruno.pessa...@gmail.com wrote: >> >> Hi everyone. I'm still learning how to run Ceph properly in production. I >> have a a cluster (Reef 18.2.4) with 10 nodes (8 x 15TB nvme's each). There >> are prod 2 pools, one for RGW (3 x replica) and one for CephFS (EC 8k2m). It >> was all fine but one users started store more data I started seeing: >> 1. Very high number of misplaced PG's. >> 2. OSD's very unbalanced and getting 90% full >> ``` >> ceph -s >> >> cluster: >> id: 7805xxxe-6ba7-11ef-9cda-0xxxcxxx0 >> health: HEALTH_WARN >> Low space hindering backfill (add storage if this doesn't resolve >> itself): 195 pgs backfill_toofull >> 150 pgs not deep-scrubbed in time >> 150 pgs not scrubbed in time >> >> services: >> mon: 5 daemons, quorum host01,host02,host03,host04,host05 (age 7w) >> mgr: host01.bwqkna(active, since 7w), standbys: host02.dycdqe >> mds: 5/5 daemons up, 6 standby >> osd: 80 osds: 80 up (since 7w), 80 in (since 4M); 323 remapped pgs >> rgw: 30 daemons active (10 hosts, 1 zones) >> >> data: >> volumes: 1/1 healthy >> pools: 11 pools, 1394 pgs >> objects: 159.65M objects, 279 TiB >> usage: 696 TiB used, 421 TiB / 1.1 PiB avail >> pgs: 230137879/647342099 objects misplaced (35.551%) >> 1033 active+clean >> 180 active+remapped+backfill_toofull >> 123 active+remapped+backfill_wait >> 28 active+clean+scrubbing >> 15 active+remapped+backfill_wait+backfill_toofull >> 10 active+clean+scrubbing+deep >> 5 active+remapped+backfilling >> >> io: >> client: 668 MiB/s rd, 11 MiB/s wr, 1.22k op/s rd, 1.15k op/s wr >> recovery: 479 MiB/s, 283 objects/s >> >> progress: >> Global Recovery Event (5w) >> [=====================.......] (remaining: 11d) >> ``` >> >> I've been trying to rebalance the OSD's manually since the balancer does not >> work due to: >> ``` >> "optimize_result": "Too many objects (0.355160 > 0.050000) are misplaced; >> try again later", >> ``` >> I manually re-weighted the top 10 most used OSD's and the number of >> misplaced objects are going down very slowly. I think it could take many >> weeks at that ratio. >> There's almost 40% of total free space but the RGW pool is almost full at >> ~94% I think because of OSD's unbalancing. >> ``` >> ceph df >> --- RAW STORAGE --- >> CLASS SIZE AVAIL USED RAW USED %RAW USED >> ssd 1.1 PiB 421 TiB 697 TiB 697 TiB 62.34 >> TOTAL 1.1 PiB 421 TiB 697 TiB 697 TiB 62.34 >> >> --- POOLS --- >> POOL ID PGS STORED OBJECTS USED %USED MAX >> AVAIL >> .mgr 1 1 69 MiB 15 207 MiB 0 >> 13 TiB >> .nfs 2 32 172 KiB 43 574 KiB 0 >> 13 TiB >> .rgw.root 3 32 2.7 KiB 6 88 KiB 0 >> 13 TiB >> default.rgw.log 4 32 2.1 MiB 209 7.0 MiB 0 >> 13 TiB >> default.rgw.control 5 32 0 B 8 0 B 0 >> 13 TiB >> default.rgw.meta 6 32 97 KiB 280 3.5 MiB 0 >> 13 TiB >> default.rgw.buckets.index 7 32 16 GiB 2.41k 47 GiB 0.11 >> 13 TiB >> default.rgw.buckets.data 10 1024 197 TiB 133.75M 592 TiB 93.69 >> 13 TiB >> default.rgw.buckets.non-ec 11 32 78 MiB 1.43M 17 GiB 0.04 >> 13 TiB >> cephfs.cephfs01.data 12 144 83 TiB 23.99M 103 TiB 72.18 >> 32 TiB >> cephfs.cephfs01.metadata 13 1 952 MiB 483.14k 3.7 GiB 0 >> 10 TiB >> ``` >> >> I also tried changing the following but it does not seem to persist: >> ``` >> # ceph-conf --show-config | egrep >> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills" >> osd_max_backfills = 1 >> osd_recovery_max_active = 0 >> osd_recovery_max_active_hdd = 3 >> osd_recovery_max_active_ssd = 10 >> osd_recovery_op_priority = 3 >> # ceph config set osd osd_max_backfills 10 >> # ceph-conf --show-config | egrep >> "osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills" >> osd_max_backfills = 1 >> osd_recovery_max_active = 0 >> osd_recovery_max_active_hdd = 3 >> osd_recovery_max_active_ssd = 10 >> osd_recovery_op_priority = 3 >> ``` >> >> 1. Why I ended up with so many misplaced PG's since there were no changes on >> the cluster: number of osd's, hosts, etc. >> 2. Is it ok to change the target_max_misplaced_ratio to something higher >> than .05 so the autobalancer would work and I wouldn't have to constantly >> rebalance the osd's manually? >> 3. Is there a way to speed up the rebalance? >> 4. Any other recommendation that could help to make my cluster healthy again? >> >> Thank you! >> >> Bruno >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io