Hello Nima,
Was the OSD restarted *after* you disabled compact_on_deletion?
From what I saw on a busy RGW cluster, compact_on_deletion triggers
rocksdb compaction much more frequently (more than a factor 2), but this
is clearly workload dependent. I cannot tell whether compacting twice as
many times produces the latency increase you have observed.
Cheers,
Enrico
On 5/10/25 20:19, Nima AbolhassanBeigi wrote:
Hi Enrico
We were aware of the change on the OSD scheduler before upgrading and
changed it back to wpq. We put off the change of this option for
another time. However, to make sure this wasn't the root cause of the
problem, we set the norecover flag on the cluster, and I suspect that
if the recovery process were the reason behind this issue, it would
have been mitigated.
I think the rocksdb compaction_on_deletion option is what caused the
problem, but as, for the time being, we can't put our theory to the
test, I'm not quite sure it is what caused it.
The reason we suspect so is because I checked the OSD logs and used a
Python script that parses the compaction logs and gives useful
information about all the compaction operations that happened on the
OSD in that log file.
I compared the extracted data from logs between two OSDs, one had the
option set to true and the other set to false. The number of
compactions on the second OSD was almost half of what it was on the
first one, and as we had the experience of rocksdb compactions
affecting the latency of our OSDs, especially the SSD ones, we came to
this conclusion.
The compaction data
(https://docs.google.com/document/d/1qXDAHvJnfnOPN4ZevdGbgHmsMse3UpL0gZfSvVicW6s/edit?usp=sharing)
*Do you think this is possible?*
And to answer your question about the kind of workload, it is quite
diverse. We serve both RBD and S3 clients. Our clients have different
workloads, and I cannot describe their behaviour in a certain pattern.
This might be a bad idea, and we probably should break the clusters
down into several more specific-purpose clusters, but that's a problem
for another day.
If you have any specific questions in mind that I should answer,
*don't hesitate to ask.* And by the increase in the number of
compactions on our OSDs, after the option was set to true, we could
say that the workload could have high delete requests, which I think
is what causes the high number of tombstones.
Regards,
Nima
On Tue, Apr 29, 2025, 2:33 PM Enrico Bocchi <enrico.boc...@cern.ch> wrote:
Hello Nima,
Unsure if you have found the root cause of the problem in the
meantime>
From the top of my head, if any useful:
- Quincy changes the default scheduler from wpq to mclock
- The default number of scrubs on each osd is increased from 1 to 3
- There's a new rocksdb compact_on_deletion option that triggers
compaction more frequently according to number of tombstones over a
sliding window.
The latter, however, is very workload dependent. What type of
workload
does the cluster serve?
Cheers,
Enrico
On 4/15/25 12:13, Nima AbolhassanBeigi wrote:
> Hello again
> We have done some investigating. Since our first message did not
have any
> information or context, let me add some new information.
>
> I looked through our dashboards and the write latency panels in
the OSD
> Overview dashboard.
> The latency for write operations had increased significantly, but in
> distinction, the physical write more or less stayed the same;
however, the
> write process operations and write prepare operations have increased
> dramatically.
>
> In the end, the RocksDB tunings helped our situation and
decreased the
> latency more or less to the pre-upgrade amount.
> We changed the default parameters to the following:
>
compression=kNoCompression,max_write_buffer_number=128,min_write_buffer_number_to_merge=16,compaction_style=kCompactionStyleLevel,write_buffer_size=8388608,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0
> This seems to have done the trick, however, we are still worried
that we
> haven't found the root cause of this problem and only mitigated the
> symptoms.
>
> Another probably important information is that the bluefs
allocator for our
> OSDs is not the default hybrid and is set to the bitmap option. In
> addition, the bluestore allocation block score of these OSDs is,
for most
> of them, almost 0.9, which I know is terrible, but what I don't
understand
> is why this problem occurred post-upgrade. Could it be the
terrible block
> scores?
>
> Another interesting change was the memory usage of each OSD
post-upgrade,
> which dropped nearly in half, but the Node's memory cache/buffer
usage
> increased.
>
> I've put the Grafana panels screenshots in this Google Doc
>
<https://docs.google.com/document/d/1DSf4MJoze_BTSWAJWoYyQetA33P6fwpMavYXPdAk5kU/edit?usp=sharing>
> .
> We started our upgrade process on March 6 and then again on
March 7. We
> tuned the RocksDB parameters on March 9th, as the screenshots
show, which
> drops the latency.
>
> We are still looking for the root cause, and I could use all the
help I can
> get to try to find it, and it is much appreciated.
>
> Regards
> Nima
>
> On Mon, Apr 7, 2025 at 3:27 PM Nima AbolhassanBeigi <
> nima.abolhassanbe...@gmail.com> wrote:
>
>> Hi dear ceph community
>>
>> We have encountered an issue with our ceph cluster after
upgrading from
>> v16.2.13 to v17.2.7.
>> The issue is that the write latency on OSDs has increased
significantly
>> and doesn't seem to plummet back down.
>> The average write latency has almost doubled, and this has
happened since
>> we upgraded the OSDs.
>>
>> If anybody could help figure this out.
>>
>> Kind regards
>> Nima
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io