Hello again We have done some investigating. Since our first message did not have any information or context, let me add some new information.
I looked through our dashboards and the write latency panels in the OSD Overview dashboard. The latency for write operations had increased significantly, but in distinction, the physical write more or less stayed the same; however, the write process operations and write prepare operations have increased dramatically. In the end, the RocksDB tunings helped our situation and decreased the latency more or less to the pre-upgrade amount. We changed the default parameters to the following: compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0 This seems to have done the trick, however, we are still worried that we haven't found the root cause of this problem and only mitigated the symptoms. Another probably important information is that the bluefs allocator for our OSDs is not the default hybrid and is set to the bitmap option. In addition, the bluestore allocation block score of these OSDs is, for most of them, almost 0.9, which I know is terrible, but what I don't understand is why this problem occurred post-upgrade. Could it be the terrible block scores? Another interesting change was the memory usage of each OSD post-upgrade, which dropped nearly in half, but the Node's memory cache/buffer usage increased. I've put the Grafana panels screenshots in this Google Doc <https://docs.google.com/document/d/1DSf4MJoze_BTSWAJWoYyQetA33P6fwpMavYXPdAk5kU/edit?usp=sharing> . We started our upgrade process on March 6 and then again on March 7. We tuned the RocksDB parameters on March 9th, as the screenshots show, which drops the latency. We are still looking for the root cause, and I could use all the help I can get to try to find it, and it is much appreciated. Regards Nima On Mon, Apr 7, 2025 at 3:27 PM Nima AbolhassanBeigi < nima.abolhassanbe...@gmail.com> wrote: > Hi dear ceph community > > We have encountered an issue with our ceph cluster after upgrading from > v16.2.13 to v17.2.7. > The issue is that the write latency on OSDs has increased significantly > and doesn't seem to plummet back down. > The average write latency has almost doubled, and this has happened since > we upgraded the OSDs. > > If anybody could help figure this out. > > Kind regards > Nima > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io