I don't know how many pools you have in your cluster but ~37 PGs per OSD seems quite low, especially with NVMes. You could try increasing the number of PGs on this pool and maybe the data pool also. I don't know how many iops this bucket receives but the fact that index is spread over only 11 rados objects could be a bottleneck with very intensive PUT/DELETE workloads. Maybe someone could confirm that.
Also check for 'tombstones' and this topic [1] in particular, especially if the bucket receives a lot of PUT/DELETE operations in real time. Regards, Frédéric. [1] https://www.spinics.net/lists/ceph-users/msg81519.html ----- Le 14 Nov 24, à 10:55, Istvan Szabo, Agoda <istvan.sz...@agoda.com> a écrit : > 156x NVME osd > Sharding I do like 100000 objects/1 shard. Default 11 but they don't have 1.1m > objects. > This is the tree: [ > https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 | > https://gist.github.com/Badb0yBadb0y/835a45f8e82ddfcbbd82cf28126da728 ] > From: Frédéric Nass <frederic.n...@univ-lorraine.fr> > Sent: Thursday, November 14, 2024 4:28 PM > To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> > Cc: Ceph Users <ceph-users@ceph.io> > Subject: Re: [ceph-users] Re: Slow ops during index pool recovery causes > cluster > performance drop to 1% > Email received from the internet. If in doubt, don't click any link nor open > any > attachment ! > Hi Istvan, >> Only thing what I have in my mind to increase the replica size from 3 to 5 >> so it > > could tollerate more osd slowness with size 5 min_size 2. > I wouldn't do that, it will only get worse as every write IO will have to wait > for 2 mores OSDs to ACK and the slow ops you've seen refer to write IOs > (looping over "waiting for rw locks"). > How many NVMe OSDs does this 2048 PGs RGW index pool has? > Have you check the num_shards of this bucket that is receiving continuous > deletes and uploads 24/7? > Regards, > Frédéric. > ----- Le 14 Nov 24, à 7:16, Istvan Szabo, Agoda <istvan.sz...@agoda.com> a > écrit > : >> Hi, >> This issue was for us before update also, unluckiuly it's not gone with >> update >> 😕 >> We don't use HDD, only ssd and nvme and the index pool is specifically on >> nvme. >> Yes, I tried to set for the value divided by 4, no luck 🙁 >> Seems like based on metadata okay, however the device class when I've >> created it >> I've defined nvme (ceph-volume lvm batch --bluestore --yes --osds-per-device >> 4 >> --crush-device-class nvme /dev/sdo) and in the osd tree it is nvme, but I >> guess >> it means what it says by default if I don't define anything it would have >> been >> ssd. >> "bluestore_bdev_type": "ssd", >> "default_device_class": "ssd", >> "osd_objectstore": "bluestore", >> "rotational": "0" >> Only thing what I have in my mind to increase the replica size from 3 to 5 >> so it >> could tollerate more osd slowness with size 5 min_size 2. >> Again, thank you again for your ideas. >> From: Frédéric Nass <frederic.n...@univ-lorraine.fr> >> Sent: Wednesday, November 13, 2024 4:32 PM >> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> >> Cc: Tyler Stachecki <stachecki.ty...@gmail.com>; Ceph Users >> <ceph-users@ceph.io> >> Subject: Re: [ceph-users] Re: Slow ops during index pool recovery causes >> cluster >> performance drop to 1% >> Email received from the internet. If in doubt, don't click any link nor open >> any >> attachment ! >> Hi Istvan, >> Changing the scheduler to 'wpq' could help you to quickly identify if the >> issue >> you're facing is related to 'mclock' or not. >> If you stick with mclock, depending on the rotational status of each OSD >> (ceph >> osd metadata N | jq -r .rotational), you should set each OSD's spec >> (osd_mclock_max_capacity_iops_hdd if rotational=1 or >> osd_mclock_max_capacity_iops_ssd if rotational=0) to the value you >> calculated, >> instead of letting the OSD trying to figure out and set a value that may not >> be >> accurate, especially with multiple OSDs sharing the same underlying device. >> Have you tried setting each OSD's max capacity (ceph config set osd.N >> osd_mclock_max_capacity_iops_[hdd, ssd])? >> Also, make sure the rotational status reported for each OSDs by ceph osd >> metadata osd.N actually matches the underlying hardware type. This is not >> always the case depending on how the disks are connected. >> If it's not, you might have to force it on boot with a udev rule. >> Regards, >> Frédéric. >> ----- Le 13 Nov 24, à 9:43, Istvan Szabo, Agoda <istvan.sz...@agoda.com> a >> écrit >> : >>> Hi Frédéric, >>> Thank you the ideas. >>> Cluster is half updated but on the osds which updated are: >>> "osd_op_queue": "mclock_scheduler", >>> "osd_op_queue_cut_off": "high", >>> I'd say the value when I do the benchmark how ceph calculates it, it is too >>> high. We have 4 osd on 1 nvme and it sets the value on the last osd from >>> the 4 >>> on nvme which is the highest: >>> 36490.280637 >>> However I changed this value already on some other fully upgraded cluster >>> divided by 4 and didn't help. >>> Buffered io turned on since octopus, didn't change it. >>> For a quick check that specific osd seems like what you tell: >>> 1 : device size 0x6fc7c00000 : own >>> 0x[40000~4e00000,12f70000~2252d0000,23b060000~21a230000,4583e0000~20f890000,6b1630000~200000000,35a78f0000~478a20000] >>> = 0xccc5b0000 : using 0xa60ed0000(42 GiB) : bluestore has 0x62e79f0000(396 >>> GiB) >>> available >>> wal_total:0, db_total:456087987814, slow_total:0 >>> Istvan >>> From: Frédéric Nass <frederic.n...@univ-lorraine.fr> >>> Sent: Monday, November 4, 2024 4:14 PM >>> To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> >>> Cc: Tyler Stachecki <stachecki.ty...@gmail.com>; Ceph Users >>> <ceph-users@ceph.io> >>> Subject: Re: [ceph-users] Re: Slow ops during index pool recovery causes >>> cluster >>> performance drop to 1% >>> Email received from the internet. If in doubt, don't click any link nor >>> open any >>> attachment ! >>> ________________________________ >>> Hi Istvan, >>> Is you upgraded cluster using wpq or mclock scheduler? (ceph tell osd.X >>> config >>> show | grep osd_op_queue) >>> Maybe your OSDs set their osd_mclock_max_capacity_iops_* capacity too low on >>> start (ceph config dump | grep osd_mclock_max_capacity_iops) limiting their >>> performance. >>> You might want to raise these figures if set or go back to wpq to give you >>> enough time to understand how mclock works. >>> Also, check bluefs_buffered_io as it's default value changed over time. >>> Better >>> run 'true' now (ceph tell osd.X config show | grep bluefs_buffered_io) >>> Also, check for any overspilling as there's been a bug in the past with >>> overspilling not being reported on ceph status (ceph tell osd.X bluefs >>> stats, >>> SLOW line should show 0 Bytes and 0 FILES). >>> Regards, >>> Frédéric. >>> ----- Le 4 Nov 24, à 5:24, Istvan Szabo, Agoda istvan.sz...@agoda.com a >>> écrit : >>> > Hi Tyler, >>> > To be honest we don't have anything set by ourselves regarding compaction >>> > and >>> > rocksdb: >>> > When I check the socket with ceph daemon on nvme and on ssd both have >>> > default >>> > false on compactL >>> > "mon_compact_on_start": "false" >>> > "osd_compact_on_start": "false", >>> > Rocksdb also default: >>> > bluestore_rocksdb_options": >>> > "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824" >>> > This is 1 event during the slow ops out of the 20: >>>> [ >>>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>> > | >>> https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt >>> ] >>> > All belongs to a bucket which doing streaming operation which means >>> > continuous >>> > delete and upload 24/7. >>> > I can see throttled options but still don't understand why the high >>> > latency. >>> > ty >>> > ________________________________ >>> > From: Tyler Stachecki <stachecki.ty...@gmail.com> >>> > Sent: Sunday, November 3, 2024 4:07 PM >>> > To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com> >>> > Cc: Ceph Users <ceph-users@ceph.io> >>> > Subject: Re: [ceph-users] Re: Slow ops during index pool recovery causes >>> > cluster >>> > performance drop to 1% >>> > Email received from the internet. If in doubt, don't click any link nor >>> > open any >>> > attachment ! >>> > ________________________________ >>> > On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda) >>> > <istvan.sz...@agoda.com> wrote: >>> >> Hi, >>> >> I'm updating from octopus to quincy and all in our cluster when index >>> >> pool >>> >> recovery kicks off, cluster operation drops to 1%, slow ops comes >>> >> non-stop. >>> >> The recovery takes 1-2 hours/nodes. >>> >> What I can see the iowait on the nvme drives which belongs to the index >>> >> pool is >>> >> pretty high, however the throughput is less than 500MB/s, the iops is >>> >> less than >>> >> 5000/sec. >>> > ... >>> >> after update and machine reboot compaction kicks off which generates >>> >> 30-40 >>> >> iowait on the node, we prevent with "noup" flag to put these osds into >>> >> the >>> >> cluster until compaction finished, however when we have 0 iowait after >>> >> compaction, I unset noup so recovery can start which causes the above >>> >> issue. If >>> >> I wouldn't set noup it would cause even bigger issue. >>> > By any chance, are you specifying a value for >>> > bluestore_rocksdb_options in your ceph.conf? The compaction >>> > observation at reboot in particular is odd. >>> > Tyler >>> > ________________________________ >>> > This message is confidential and is for the sole use of the intended >>> > recipient(s). It may also be privileged or otherwise protected by >>> > copyright or >>> > other legal rules. If you have received it by mistake please let us know >>> > by >>> > reply email and delete it from your system. It is prohibited to copy this >>> > message or disclose its content to anyone. Any confidentiality or >>> > privilege is >>> > not waived or lost by any mistaken delivery or unauthorized disclosure of >>> > the >>> > message. All messages sent to and from Agoda may be monitored to ensure >>> > compliance with company policies, to protect the company's interests and >>> > to >>> > remove potential malware. Electronic messages may be intercepted, >>> > amended, lost >>> > or deleted, or contain viruses. >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@ceph.io >>> > To unsubscribe send an email to ceph-users-le...@ceph.io >>> This message is confidential and is for the sole use of the intended >>> recipient(s). It may also be privileged or otherwise protected by copyright >>> or >>> other legal rules. If you have received it by mistake please let us know by >>> reply email and delete it from your system. It is prohibited to copy this >>> message or disclose its content to anyone. Any confidentiality or privilege >>> is >>> not waived or lost by any mistaken delivery or unauthorized disclosure of >>> the >>> message. All messages sent to and from Agoda may be monitored to ensure >>> compliance with company policies, to protect the company's interests and to >>> remove potential malware. Electronic messages may be intercepted, amended, >>> lost >>> or deleted, or contain viruses. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io