[ceph-users] Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Szabo, Istvan (Agoda) Sun, 03 Nov 2024 20:25:17 -0800

Hi Tyler,

To be honest we don't have anything set by ourselves regarding compaction and 
rocksdb:
When I check the socket with ceph daemon on nvme and on ssd both have default 
false on compactL
"mon_compact_on_start": "false"
"osd_compact_on_start": "false",

Rocksdb also default:
bluestore_rocksdb_options": 
"compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2,max_total_wal_size=1073741824"

This is 1 event during the slow ops out of the 20:
https://gist.githubusercontent.com/Badb0yBadb0y/30de736f5d2bd6ec48aa7acf0a3caa14/raw/1070acbf82cc8d69efc04e4e0583e7f83bd33b3f/gistfile1.txt

All belongs to a bucket which doing streaming operation which means continuous 
delete and upload 24/7.

I can see throttled options but still don't understand why the high latency.

ty

________________________________
From: Tyler Stachecki <stachecki.ty...@gmail.com>
Sent: Sunday, November 3, 2024 4:07 PM
To: Szabo, Istvan (Agoda) <istvan.sz...@agoda.com>
Cc: Ceph Users <ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: Slow ops during index pool recovery causes 
cluster performance drop to 1%

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !
________________________________

On Sun, Nov 3, 2024 at 1:28 AM Szabo, Istvan (Agoda)
<istvan.sz...@agoda.com> wrote:
> Hi,
>
> I'm updating from octopus to quincy and all in our cluster when index pool 
> recovery kicks off, cluster operation drops to 1%, slow ops comes non-stop.
> The recovery takes 1-2 hours/nodes.
>
> What I can see the iowait on the nvme drives which belongs to the index pool 
> is pretty high, however the throughput is less than 500MB/s, the iops is less 
> than 5000/sec.
...
> after update and machine reboot compaction kicks off which generates 30-40 
> iowait on the node, we prevent with "noup" flag to put these osds into the 
> cluster until compaction finished, however when we have 0 iowait after 
> compaction, I unset noup so recovery can start which causes the above issue. 
> If I wouldn't set noup it would cause even bigger issue.

By any chance, are you specifying a value for
bluestore_rocksdb_options in your ceph.conf? The compaction
observation at reboot in particular is odd.

Tyler

________________________________
This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow ops during index pool recovery causes cluster performance drop to 1%

Reply via email to