[ceph-users] Re: 18.2.6 : OSD(s) experiencing slow operations in BlueStore

Michel Jouvin Fri, 02 May 2025 13:19:51 -0700

Hi,

Thanks! FYI, it was a transient problem as explained inhttps://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-slow-op-alertand the warning disappeared after bluestore_slow_ops_warn_lifetime (thatI reduced to 1/2 day).


Michel

Le 02/05/2025 à 18:27, Devender Singh a écrit :

Sorry some typo. It is mclock.
Exact parameter osd_recovery_max_active_ssd/hdd. Is 10, to reduce you have
to override mclock to true.

Restarting osd daemon alone will solve your issue.

Regards
Dev

On Fri, 2 May 2025 at 9:07 AM, Devender Singh <deven...@netskrt.io> wrote:

Hello

Try restarting osds showing slow ops.
Also if any recovery going on then max recovery drives for Malcom is 10
try reducing it. Will resolve this issue.
If it persist for a drive then check for smart TK for errors and replace
that drive.

Regards
Dev

On Fri, 2 May 2025 at 8:45 AM, Maged Mokhtar <mmokh...@petasan.org> wrote:

On 02/05/2025 13:57, Frédéric Nass wrote:

To clarify, there's no "issue" with the code itself. It's just that the

code now reveals a potential "issue" with the OSD's underlying device, as
Igor explained.

This warning can pop up starting from Quincy v17.2.8 (PR 59468), Reef

v18.2.5 (PR #59466) and Squid v19.2.1 (PR #59464).

Regards,
Frédéric.


Thanks Igor and Frederic for the clarifications.

However, this begs the question, what should users do see-ing such slow
ops?
  From quoted link:

https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-slow-op-alert
Which states it could be a drive issue, but not always...

So i think it could be helpful to share information/experiences of what
users find to be the root cause of such issues.
  From our side:

1) With Octopus and earlier, we rarely saw such logs, and when they
happened, it was mainly bad drives.

2) When we made an upgrade from Octopus->Quincy, we started to see more
users complain.
The complaint was not always due to a warning, but generally slower
performance + higher latencies seen on charts + we can see it in the
logs for a time period like:
grep -r "slow operation observed for" /var/log/ceph  | grep "2024-11"

3) Many users with issue, reported improvement when they stopped/reduced
bulk deletions like heavy patterns of block rbd trim/discard/reclaim.
This recommendation was influenced by messages from Igor and Mark Nelson
on slow bulk deletions.
It was also noticeable that after stopping trim, the cluster will not
report issues even at significantly higher client load.
This constituted the larger portion of issues we saw.

4) Generally performing an offline db compaction also helped:
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-XX compact

5) For non-db related warnings, some older OSDs had high fragmentation
ceph daemon osd.XX bluestore allocator score block
Deleting and re-adding the same drive helped slow ops.

6) To a lesser extent, the logs do indicate a defective drive or a drive
with a different model/type that has much less performance than the
other models in cluster/pool.


/Maged

----- Le 2 Mai 25, à 12:36, Eugen Block ebl...@nde.ag a écrit :

The link Frederic shared is for 19.2.1, so yes, the new warning
appeared in 19.2.1 as well.

Zitat von Laimis Juzeliūnas <laimis.juzeliu...@oxylabs.io>:

Hi all,

Could this also be an issue with 19.2.2?
We have seen few of these warnings right after upgrading from
19.2.0. A simple OSD restart removed them, but we haven’t seen them
before.
There are some users on the Ceph Slack channels discussing this
observation in 19.2.2 as well.

Best,
Laimis J.

On 2 May 2025, at 13:11, Igor Fedotov <igor.fedo...@croit.io> wrote:

Hi Everyone,

well, indeed this warning has been introduced in 18.2.6.

But I wouldn't say that's not an issue. Having it permanently
visible (particularly for a specific OSD only) might indicate some
issues with this OSD which could negatively impact overall cluster
performance.

OSD log to be checked for potential clues and more research on the
root cause is recommended.

And once again - likely that's not a regression in 18.2.6 but
rather some additional diagnostics brought by the release which
reveals a potential issue.


Thanks,

Igor

On 02.05.2025 11:19, Frédéric Nass wrote:

Hi Michel,

This is not an issue. It's a new warning that can be adjusted or
muted. Check this thread [1] and this part [2] of the Reef
documentation about this new alert.
Came to Reef with PR #59466 [3].

Cheers,
Frédéric.

[1]

https://www.google.com/url?q=https://www.spinics.net/lists/ceph-users/msg86131.html&source=gmail-imap&ust=1746785596000000&usg=AOvVaw27M4y8QaoDcRiJBkxDVVoK

[2]

https://www.google.com/url?q=https://docs.ceph.com/en/latest/rados/operations/health-checks/%23bluestore-slow-op-alert&source=gmail-imap&ust=1746785596000000&usg=AOvVaw21VoozoT2KT6FESbkkVJ_w

[3]

https://www.google.com/url?q=https://github.com/ceph/ceph/pull/59466&source=gmail-imap&ust=1746785596000000&usg=AOvVaw0nnpOvrWFLB1lAk0Ekms1i

----- Le 2 Mai 25, à 9:44, Michel Jouvin
michel.jou...@ijclab.in2p3.fr a écrit :

Hi,

Since our upgrade to 18.2.6 2 days ago, our cluster is reporting

the

warning "1 OSD(s) experiencing slow operations in BlueStore":

[root@dig-osd4 bluestore-slow-ops]# ceph health detail
HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow

operations in

BlueStore
       osd.247 observed slow operation indications in BlueStore

I have never seen this warning before so I've the feeling it is

somehow

related to the upgrade and it doesn't seem related to the

regression

mentioned in another thread (that should result in an OSD crash).
Googling quickly, I found this reported on 19.2.1 with SSD where

in my

case it is an HDD. I don't know if the workaround mentioned in the

issue

(bdev_xxx_discard=true) also applies to 18.2.6...

Did somebody saw this in 18.2.x? Any recommandation? Our plan was,
according to best practicies described recently in another thread

to

move from 18.2.2 to 18.2.6 and then from 18.2.6 to 19.2.2... Will

19.2.2

clear this issue (at the risk of others as it is probably not
widely used)?

Best regards,

Michel
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.6 : OSD(s) experiencing slow operations in BlueStore

Reply via email to