[ceph-users] Re: write latency increase after upgrade from pacific to quincy

Enrico Bocchi Thu, 15 May 2025 04:36:43 -0700

Hello Nima,

Was the OSD restarted *after* you disabled compact_on_deletion?

From what I saw on a busy RGW cluster, compact_on_deletion triggersrocksdb compaction much more frequently (more than a factor 2), but thisis clearly workload dependent. I cannot tell whether compacting twice asmany times produces the latency increase you have observed.


Cheers,
Enrico


On 5/10/25 20:19, Nima AbolhassanBeigi wrote:

Hi Enrico
We were aware of the change on the OSD scheduler before upgrading andchanged it back to wpq. We put off the change of this option foranother time. However, to make sure this wasn't the root cause of theproblem, we set the norecover flag on the cluster, and I suspect thatif the recovery process were the reason behind this issue, it wouldhave been mitigated.
I think the rocksdb compaction_on_deletion option is what caused theproblem, but as, for the time being, we can't put our theory to thetest, I'm not quite sure it is what caused it.The reason we suspect so is because I checked the OSD logs and used aPython script that parses the compaction logs and gives usefulinformation about all the compaction operations that happened on theOSD in that log file.I compared the extracted data from logs between two OSDs, one had theoption set to true and the other set to false. The number ofcompactions on the second OSD was almost half of what it was on thefirst one, and as we had the experience of rocksdb compactionsaffecting the latency of our OSDs, especially the SSD ones, we came tothis conclusion.The compaction data(https://docs.google.com/document/d/1qXDAHvJnfnOPN4ZevdGbgHmsMse3UpL0gZfSvVicW6s/edit?usp=sharing)
*Do you think this is possible?*
And to answer your question about the kind of workload, it is quitediverse. We serve both RBD and S3 clients. Our clients have differentworkloads, and I cannot describe their behaviour in a certain pattern.This might be a bad idea, and we probably should break the clustersdown into several more specific-purpose clusters, but that's a problemfor another day.If you have any specific questions in mind that I should answer,*don't hesitate to ask.* And by the increase in the number ofcompactions on our OSDs, after the option was set to true, we couldsay that the workload could have high delete requests, which I thinkis what causes the high number of tombstones.
Regards,
Nima

On Tue, Apr 29, 2025, 2:33 PM Enrico Bocchi <enrico.boc...@cern.ch> wrote:

    Hello Nima,

    Unsure if you have found the root cause of the problem in the
    meantime>
     From the top of my head, if any useful:
    - Quincy changes the default scheduler from wpq to mclock
    - The default number of scrubs on each osd is increased from 1 to 3
    - There's a new rocksdb compact_on_deletion option that triggers
    compaction more frequently according to number of tombstones over a
    sliding window.

    The latter, however, is very workload dependent. What type of
    workload
    does the cluster serve?

    Cheers,
    Enrico


    On 4/15/25 12:13, Nima AbolhassanBeigi wrote:
    > Hello again
    > We have done some investigating. Since our first message did not
    have any
    > information or context, let me add some new information.
    >
    > I looked through our dashboards and the write latency panels in
    the OSD
    > Overview dashboard.
    > The latency for write operations had increased significantly, but in
    > distinction, the physical write more or less stayed the same;
    however, the
    > write process operations and write prepare operations have increased
    > dramatically.
    >
    > In the end, the RocksDB tunings helped our situation and
    decreased the
    > latency more or less to the pre-upgrade amount.
    > We changed the default parameters to the following:
    >
    
compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0
    > This seems to have done the trick, however, we are still worried
    that we
    > haven't found the root cause of this problem and only mitigated the
    > symptoms.
    >
    > Another probably important information is that the bluefs
    allocator for our
    > OSDs is not the default hybrid and is set to the bitmap option. In
    > addition, the bluestore allocation block score of these OSDs is,
    for most
    > of them, almost 0.9, which I know is terrible, but what I don't
    understand
    > is why this problem occurred post-upgrade. Could it be the
    terrible block
    > scores?
    >
    > Another interesting change was the memory usage of each OSD
    post-upgrade,
    > which dropped nearly in half, but the Node's memory cache/buffer
    usage
    > increased.
    >
    > I've put the Grafana panels screenshots in this Google Doc
    >
    
<https://docs.google.com/document/d/1DSf4MJoze_BTSWAJWoYyQetA33P6fwpMavYXPdAk5kU/edit?usp=sharing>
    > .
    > We started our upgrade process on March 6 and then again on
    March 7. We
    > tuned the RocksDB parameters on March 9th, as the screenshots
    show, which
    > drops the latency.
    >
    > We are still looking for the root cause, and I could use all the
    help I can
    > get to try to find it, and it is much appreciated.
    >
    > Regards
    > Nima
    >
    > On Mon, Apr 7, 2025 at 3:27 PM Nima AbolhassanBeigi <
    > nima.abolhassanbe...@gmail.com> wrote:
    >
    >> Hi dear ceph community
    >>
    >> We have encountered an issue with our ceph cluster after
    upgrading from
    >> v16.2.13 to v17.2.7.
    >> The issue is that the write latency on OSDs has increased
    significantly
    >> and doesn't seem to plummet back down.
    >> The average write latency has almost doubled, and this has
    happened since
    >> we upgraded the OSDs.
    >>
    >> If anybody could help figure this out.
    >>
    >> Kind regards
    >> Nima
    >>
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@ceph.io
    > To unsubscribe send an email to ceph-users-le...@ceph.io
--Enrico Bocchi
    CERN European Laboratory for Particle Physics
    IT - Storage & Data Management  - General Storage Services
    Mailbox: G20500 - Office: 31-2-010
    1211 Genève 23
    Switzerland

--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: write latency increase after upgrade from pacific to quincy

Reply via email to