Hi Enrico
We were aware of the change on the OSD scheduler before upgrading and
changed it back to wpq. We put off the change of this option for another
time. However, to make sure this wasn't the root cause of the problem, we
set the norecover flag on the cluster, and I suspect that if the recovery
process were the reason behind this issue, it would have been mitigated.

I think the rocksdb compaction_on_deletion option is what caused the
problem, but as, for the time being, we can't put our theory to the test,
I'm not quite sure it is what caused it.
The reason we suspect so is because I checked the OSD logs and used a
Python script that parses the compaction logs and gives useful information
about all the compaction operations that happened on the OSD in that log
file.
I compared the extracted data from logs between two OSDs, one had the
option set to true and the other set to false. The number of compactions on
the second OSD was almost half of what it was on the first one, and as we
had the experience of rocksdb compactions affecting the latency of our
OSDs, especially the SSD ones, we came to this conclusion.
The compaction data (
https://docs.google.com/document/d/1qXDAHvJnfnOPN4ZevdGbgHmsMse3UpL0gZfSvVicW6s/edit?usp=sharing
)
*Do you think this is possible?*

And to answer your question about the kind of workload, it is quite
diverse. We serve both RBD and S3 clients. Our clients have different
workloads, and I cannot describe their behaviour in a certain pattern. This
might be a bad idea, and we probably should break the clusters down into
several more specific-purpose clusters, but that's a problem for another
day.
If you have any specific questions in mind that I should answer, *don't
hesitate to ask.* And by the increase in the number of compactions on our
OSDs, after the option was set to true, we could say that the workload
could have high delete requests, which I think is what causes the high
number of tombstones.

Regards,
Nima

On Tue, Apr 29, 2025 at 3:40 PM Nima AbolhassanBeigi <
nima.abolhassanbe...@gmail.com> wrote:

> Hi Enrico
> We were aware of the change on the OSD scheduler before upgrading and
> changed it back to wpq. We put off the change of this option for another
> time. However, to make sure this wasn't the root cause of the problem, we
> set the norecover flag on the cluster, and I suspect that if the recovery
> process were the reason behind this issue, it would have been mitigated.
>
> I think the rocksdb compaction_on_deletion option is what caused the
> problem, but as, for the time being, we can't put our theory to the test,
> I'm not quite sure it is what caused it.
> The reason we suspect so is because I checked the OSD logs and used a
> Python script that parses the compaction logs and gives useful information
> about all the compaction operations that happened on the OSD in that log
> file.
> I compared the extracted data from logs between two OSDs, one had the
> option set to true and the other set to false. The number of compactions on
> the second OSD was almost half of what it was on the first one, and as we
> had the experience of rocksdb compactions affecting the latency of our
> OSDs, especially the SSD ones, we came to this conclusion.
> The compaction data (
> https://docs.google.com/document/d/1qXDAHvJnfnOPN4ZevdGbgHmsMse3UpL0gZfSvVicW6s/edit?usp=sharing
> )
> *Do you think this is possible?*
>
> And to answer your question about the kind of workload, it is quite
> diverse. We serve both RBD and S3 clients. Our clients have different
> workloads, and I cannot describe their behaviour in a certain pattern. This
> might be a bad idea, and we probably should break the clusters down into
> several more specific-purpose clusters, but that's a problem for another
> day.
> If you have any specific questions in mind that I should answer, *don't
> hesitate to ask.* And by the increase in the number of compactions on our
> OSDs, after the option was set to true, we could say that the workload
> could have high delete requests, which I think is what causes the high
> number of tombstones.
>
> Regards,
> Nima
>
> On Tue, Apr 29, 2025 at 2:33 PM Enrico Bocchi <enrico.boc...@cern.ch>
> wrote:
>
>> Hello Nima,
>>
>> Unsure if you have found the root cause of the problem in the meantime>
>>  From the top of my head, if any useful:
>> - Quincy changes the default scheduler from wpq to mclock
>> - The default number of scrubs on each osd is increased from 1 to 3
>> - There's a new rocksdb compact_on_deletion option that triggers
>> compaction more frequently according to number of tombstones over a
>> sliding window.
>>
>> The latter, however, is very workload dependent. What type of workload
>> does the cluster serve?
>>
>> Cheers,
>> Enrico
>>
>>
>> On 4/15/25 12:13, Nima AbolhassanBeigi wrote:
>> > Hello again
>> > We have done some investigating. Since our first message did not have
>> any
>> > information or context, let me add some new information.
>> >
>> > I looked through our dashboards and the write latency panels in the OSD
>> > Overview dashboard.
>> > The latency for write operations had increased significantly, but in
>> > distinction, the physical write more or less stayed the same; however,
>> the
>> > write process operations and write prepare operations have increased
>> > dramatically.
>> >
>> > In the end, the RocksDB tunings helped our situation and decreased the
>> > latency more or less to the pre-upgrade amount.
>> > We changed the default parameters to the following:
>> >
>> compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0
>> > This seems to have done the trick, however, we are still worried that we
>> > haven't found the root cause of this problem and only mitigated the
>> > symptoms.
>> >
>> > Another probably important information is that the bluefs allocator for
>> our
>> > OSDs is not the default hybrid and is set to the bitmap option. In
>> > addition, the bluestore allocation block score of these OSDs is, for
>> most
>> > of them, almost 0.9, which I know is terrible, but what I don't
>> understand
>> > is why this problem occurred post-upgrade. Could it be the terrible
>> block
>> > scores?
>> >
>> > Another interesting change was the memory usage of each OSD
>> post-upgrade,
>> > which dropped nearly in half, but the Node's memory cache/buffer usage
>> > increased.
>> >
>> > I've put the Grafana panels screenshots in this Google Doc
>> > <
>> https://docs.google.com/document/d/1DSf4MJoze_BTSWAJWoYyQetA33P6fwpMavYXPdAk5kU/edit?usp=sharing
>> >
>> > .
>> > We started our upgrade process on March 6 and then again on March 7. We
>> > tuned the RocksDB parameters on March 9th, as the screenshots show,
>> which
>> > drops the latency.
>> >
>> > We are still looking for the root cause, and I could use all the help I
>> can
>> > get to try to find it, and it is much appreciated.
>> >
>> > Regards
>> > Nima
>> >
>> > On Mon, Apr 7, 2025 at 3:27 PM Nima AbolhassanBeigi <
>> > nima.abolhassanbe...@gmail.com> wrote:
>> >
>> >> Hi dear ceph community
>> >>
>> >> We have encountered an issue with our ceph cluster after upgrading from
>> >> v16.2.13 to v17.2.7.
>> >> The issue is that the write latency on OSDs has increased significantly
>> >> and doesn't seem to plummet back down.
>> >> The average write latency has almost doubled, and this has happened
>> since
>> >> we upgraded the OSDs.
>> >>
>> >> If anybody could help figure this out.
>> >>
>> >> Kind regards
>> >> Nima
>> >>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>> --
>> Enrico Bocchi
>> CERN European Laboratory for Particle Physics
>> IT - Storage & Data Management  - General Storage Services
>> Mailbox: G20500 - Office: 31-2-010
>> 1211 Genève 23
>> Switzerland
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to