[ceph-users] Re: 18.2.6 : OSD(s) experiencing slow operations in BlueStore

Devender Singh Fri, 02 May 2025 09:29:01 -0700

Sorry some typo. It is mclock.
Exact parameter osd_recovery_max_active_ssd/hdd. Is 10, to reduce you have
to override mclock to true.


Restarting osd daemon alone will solve your issue.

Regards
Dev

On Fri, 2 May 2025 at 9:07 AM, Devender Singh <deven...@netskrt.io> wrote:

> Hello
>
> Try restarting osds showing slow ops.
> Also if any recovery going on then max recovery drives for Malcom is 10
> try reducing it. Will resolve this issue.
> If it persist for a drive then check for smart TK for errors and replace
> that drive.
>
> Regards
> Dev
>
> On Fri, 2 May 2025 at 8:45 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:
>
>>
>> On 02/05/2025 13:57, Frédéric Nass wrote:
>> > To clarify, there's no "issue" with the code itself. It's just that the
>> code now reveals a potential "issue" with the OSD's underlying device, as
>> Igor explained.
>> >
>> > This warning can pop up starting from Quincy v17.2.8 (PR 59468), Reef
>> v18.2.5 (PR #59466) and Squid v19.2.1 (PR #59464).
>> >
>> > Regards,
>> > Frédéric.
>>
>>
>> Thanks Igor and Frederic for the clarifications.
>>
>> However, this begs the question, what should users do see-ing such slow
>> ops?
>>  From quoted link:
>>
>> https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-slow-op-alert
>> Which states it could be a drive issue, but not always...
>>
>> So i think it could be helpful to share information/experiences of what
>> users find to be the root cause of such issues.
>>  From our side:
>>
>> 1) With Octopus and earlier, we rarely saw such logs, and when they
>> happened, it was mainly bad drives.
>>
>> 2) When we made an upgrade from Octopus->Quincy, we started to see more
>> users complain.
>> The complaint was not always due to a warning, but generally slower
>> performance + higher latencies seen on charts + we can see it in the
>> logs for a time period like:
>> grep -r "slow operation observed for" /var/log/ceph  | grep "2024-11"
>>
>> 3) Many users with issue, reported improvement when they stopped/reduced
>> bulk deletions like heavy patterns of block rbd trim/discard/reclaim.
>> This recommendation was influenced by messages from Igor and Mark Nelson
>> on slow bulk deletions.
>> It was also noticeable that after stopping trim, the cluster will not
>> report issues even at significantly higher client load.
>> This constituted the larger portion of issues we saw.
>>
>> 4) Generally performing an offline db compaction also helped:
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-XX compact
>>
>> 5) For non-db related warnings, some older OSDs had high fragmentation
>> ceph daemon osd.XX bluestore allocator score block
>> Deleting and re-adding the same drive helped slow ops.
>>
>> 6) To a lesser extent, the logs do indicate a defective drive or a drive
>> with a different model/type that has much less performance than the
>> other models in cluster/pool.
>>
>>
>> /Maged
>>
>>
>> >
>> > ----- Le 2 Mai 25, à 12:36, Eugen Block ebl...@nde.ag a écrit :
>> >
>> >> The link Frederic shared is for 19.2.1, so yes, the new warning
>> >> appeared in 19.2.1 as well.
>> >>
>> >> Zitat von Laimis Juzeliūnas <laimis.juzeliu...@oxylabs.io>:
>> >>
>> >>> Hi all,
>> >>>
>> >>> Could this also be an issue with 19.2.2?
>> >>> We have seen few of these warnings right after upgrading from
>> >>> 19.2.0. A simple OSD restart removed them, but we haven’t seen them
>> >>> before.
>> >>> There are some users on the Ceph Slack channels discussing this
>> >>> observation in 19.2.2 as well.
>> >>>
>> >>> Best,
>> >>> Laimis J.
>> >>>
>> >>>> On 2 May 2025, at 13:11, Igor Fedotov <igor.fedo...@croit.io> wrote:
>> >>>>
>> >>>> Hi Everyone,
>> >>>>
>> >>>> well, indeed this warning has been introduced in 18.2.6.
>> >>>>
>> >>>> But I wouldn't say that's not an issue. Having it permanently
>> >>>> visible (particularly for a specific OSD only) might indicate some
>> >>>> issues with this OSD which could negatively impact overall cluster
>> >>>> performance.
>> >>>>
>> >>>> OSD log to be checked for potential clues and more research on the
>> >>>> root cause is recommended.
>> >>>>
>> >>>> And once again - likely that's not a regression in 18.2.6 but
>> >>>> rather some additional diagnostics brought by the release which
>> >>>> reveals a potential issue.
>> >>>>
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Igor
>> >>>>
>> >>>> On 02.05.2025 11:19, Frédéric Nass wrote:
>> >>>>> Hi Michel,
>> >>>>>
>> >>>>> This is not an issue. It's a new warning that can be adjusted or
>> >>>>> muted. Check this thread [1] and this part [2] of the Reef
>> >>>>> documentation about this new alert.
>> >>>>> Came to Reef with PR #59466 [3].
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Frédéric.
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> https://www.google.com/url?q=https://www.spinics.net/lists/ceph-users/msg86131.html&source=gmail-imap&ust=1746785596000000&usg=AOvVaw27M4y8QaoDcRiJBkxDVVoK
>> >>>>> [2]
>> >>>>>
>> https://www.google.com/url?q=https://docs.ceph.com/en/latest/rados/operations/health-checks/%23bluestore-slow-op-alert&source=gmail-imap&ust=1746785596000000&usg=AOvVaw21VoozoT2KT6FESbkkVJ_w
>> >>>>> [3]
>> >>>>>
>> https://www.google.com/url?q=https://github.com/ceph/ceph/pull/59466&source=gmail-imap&ust=1746785596000000&usg=AOvVaw0nnpOvrWFLB1lAk0Ekms1i
>> >>>>>
>> >>>>> ----- Le 2 Mai 25, à 9:44, Michel Jouvin
>> >>>>> michel.jou...@ijclab.in2p3.fr a écrit :
>> >>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Since our upgrade to 18.2.6 2 days ago, our cluster is reporting
>> the
>> >>>>>> warning "1 OSD(s) experiencing slow operations in BlueStore":
>> >>>>>>
>> >>>>>> [root@dig-osd4 bluestore-slow-ops]# ceph health detail
>> >>>>>> HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore
>> >>>>>> [WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow
>> operations in
>> >>>>>> BlueStore
>> >>>>>>       osd.247 observed slow operation indications in BlueStore
>> >>>>>>
>> >>>>>> I have never seen this warning before so I've the feeling it is
>> somehow
>> >>>>>> related to the upgrade and it doesn't seem related to the
>> regression
>> >>>>>> mentioned in another thread (that should result in an OSD crash).
>> >>>>>> Googling quickly, I found this reported on 19.2.1 with SSD where
>> in my
>> >>>>>> case it is an HDD. I don't know if the workaround mentioned in the
>> issue
>> >>>>>> (bdev_xxx_discard=true) also applies to 18.2.6...
>> >>>>>>
>> >>>>>> Did somebody saw this in 18.2.x? Any recommandation? Our plan was,
>> >>>>>> according to best practicies described recently in another thread
>> to
>> >>>>>> move from 18.2.2 to 18.2.6 and then from 18.2.6 to 19.2.2... Will
>> 19.2.2
>> >>>>>> clear this issue (at the risk of others as it is probably not
>> >>>>>> widely used)?
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>>
>> >>>>>> Michel
>> >>>>>> _______________________________________________
>> >>>>>> ceph-users mailing list -- ceph-users@ceph.io
>> >>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>>>> _______________________________________________
>> >>>>> ceph-users mailing list -- ceph-users@ceph.io
>> >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list -- ceph-users@ceph.io
>> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@ceph.io
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.6 : OSD(s) experiencing slow operations in BlueStore

Reply via email to