[ceph-users] Re: write latency increase after upgrade from pacific to quincy

Nima AbolhassanBeigi Tue, 15 Apr 2025 03:14:51 -0700

Hello again
We have done some investigating. Since our first message did not have any
information or context, let me add some new information.

I looked through our dashboards and the write latency panels in the OSD
Overview dashboard.
The latency for write operations had increased significantly, but in
distinction, the physical write more or less stayed the same; however, the
write process operations and write prepare operations have increased
dramatically.

In the end, the RocksDB tunings helped our situation and decreased the
latency more or less to the pre-upgrade amount.
We changed the default parameters to the following:
compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0
This seems to have done the trick, however, we are still worried that we
haven't found the root cause of this problem and only mitigated the
symptoms.

Another probably important information is that the bluefs allocator for our
OSDs is not the default hybrid and is set to the bitmap option. In
addition, the bluestore allocation block score of these OSDs is, for most
of them, almost 0.9, which I know is terrible, but what I don't understand
is why this problem occurred post-upgrade. Could it be the terrible block
scores?

Another interesting change was the memory usage of each OSD post-upgrade,
which dropped nearly in half, but the Node's memory cache/buffer usage
increased.

I've put the Grafana panels screenshots in this Google Doc
<https://docs.google.com/document/d/1DSf4MJoze_BTSWAJWoYyQetA33P6fwpMavYXPdAk5kU/edit?usp=sharing>
.
We started our upgrade process on March 6 and then again on March 7. We
tuned the RocksDB parameters on March 9th, as the screenshots show, which
drops the latency.

We are still looking for the root cause, and I could use all the help I can
get to try to find it, and it is much appreciated.

Regards
Nima

On Mon, Apr 7, 2025 at 3:27 PM Nima AbolhassanBeigi <
nima.abolhassanbe...@gmail.com> wrote:

> Hi dear ceph community
>
> We have encountered an issue with our ceph cluster after upgrading from
> v16.2.13 to v17.2.7.
> The issue is that the write latency on OSDs has increased significantly
> and doesn't seem to plummet back down.
> The average write latency has almost doubled, and this has happened since
> we upgraded the OSDs.
>
> If anybody could help figure this out.
>
> Kind regards
> Nima
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: write latency increase after upgrade from pacific to quincy

Reply via email to