> You raised a good point. The documentation should help users to identify if 
> the underlying device is actually failing or not.

This can be slippery, especially since we might have:

Local NVMe devices
Local SAS, usually with an AIC HBA that can be from LSI, Adaptec, Areca, etc
Local SATA, with either an onboard minimal HBA or an AIC (see above)
Local FC drives (I’ll be someone still uses these)
External arrays with SAS/SATA interconnects
External arrays with FC interconnects
External AFA with NVM/TCP attachment
External arrays with iSCSI attachment
OSDs running on VMware  virtual vols with VSAN et al underneath (you don’t want 
to know)



I’d be happy to discuss ideas for a hardware troubleshooting guide, with the 
caveat that it’s a very wide space and we can really only speak in general or 
example terms.  Interested in collaborating?  I’d send you a Ceph pin ;)


> 
> Here below is what I would check, would 
> {BLOCK,WAL,DB}DEVICE_STALLED_READ_ALERT or BLUESTORE_SLOW_OP_ALERT alert be 
> triggered. I'm not certain all points apply to each of these 4 alerts. 
> Hopefully Igor, Adam or Mark can provide clarification on this:
> 
> - When such alert pops for a single OSD, start by compacting its database 
> (ceph tell osd.<id> compact or ceph-kvstore-tool) and restarting the OSD 
> (ceph orch daemon osd.<id> restart).
> 
>  If the slow ops pop again for this very OSD, verify that:
> 
>  - the OSD is not near-full (ceph health detail)
>  - the OSD's RocksDB has been properly resharded if created before Pacific 
> release (ceph-bluestore-tool show-sharding)
>  - the OSD's RocksDB has not overspilled to slow device (ceph tell osd.<id> 
> bluefs stats)
>  - the OSD's bluestore fragmentation is below 90% (ceph tell osd.<id> 
> bluestore allocator score block)
>  - the OSD's underlying hardware firmware is up-to-date (same as sane OSDs 
> using the same hardware on the same host)

This.  Lots of people don’t pay attention to drive firmware, but I’ve had 
direct experience both as a user and when I worked for a drive manufacturer 
that evinced the importance.  Ideally smoke test an update — like any update — 
in a lab then at failure-domain granularity.

> 
>  If slow ops persists, check:
> 
>  - the OSD's commit/apply latencies values reported by 'ceph osd perf' and 
> 'ceph tell osd.xxx bench'
>  - the OSD's aqu-sz on iostat -dmx 1 --pretty
> 
>  If commit/apply latencies figures are significantly higher than other sane 
> OSDs using the same hardware on the same host or if aqu-sz is high (several 
> hundreds), consider replacing the drive.

For HDDs turning off volatile cache can do wonders, but usually that would 
affect the whole fleet, not outliers.  SSD firmware though can be more 
individual in manifestation.

For HDDs, check for grown defects.  A handful is not entirely unusual, but many 
drives never grow them.  Since Nautilus these are more subclinical than they 
used to be.  Above, say, a handful one might consider replacing the drive.

For SSDs, check drive-reported lifetime used/remaining, if a drive has les than 
10-15% of rated lifetime remaining then plan before they slow down or you can’t 
replace them fast enough.  Check for a high rate of reallocated blocks.  
Sometimes a firmware update warrants a secure-erase operation and OSD 
redeployment, but this isn’t always necessary or feasible.

> 
> - When such alert pops for several OSDs in the cluster:

Look for a pattern.  Are they concentrated on one host?  If there’s a bad HBA 
or backplane, or if there’s a layer 1 issue or suboptimal bonding config, that 
can manifest this way.    Remember that lead OSDs farm out subops, so you’ll 
usually see worst offender outliers and a larger cohort of drives with 
cascading slow ops as a result.

> 
>  If the alert is consecutive to a specific workload like bulk deletions (e.g. 
> rbd trim/discard/reclaim), try to mitigate their impact with < could some 
> settings help here? > or reduce their occurrence.
> 
>  If the alert is not related to any specific workloads:
> 
>  - If affected OSDs use identical hardware on the same host (and no slow ops 
> are reported for OSDs using the same hardware on other hosts), investigate 
> host configuration: BIOS settings, firmware versions, kernel/tuned settings, 
> c-states, etc.
> 
>  - If affected OSDs use identical hardware and slow ops are reported for OSDs 
> on multiple hosts (and no slow ops are reported for OSDs using a different 
> hardware) this may indicate a performance discrepancy between different 
> hardware drives in the cluster. This may not be an issue if the overall 
> cluster's performance is high enough for your workloads. If so, adjust the 
> alert thresholds per OSD or by using hosts masks or device-class masks, for 
> this specific hardware.

If you offload WAL+DB, validate that all OSDs are actually doing so.  
Especially as OSDs are replaced over time it’s all too easy to miss maintaining 
this offload consistently.

> 
> Anyone feel free to comment.
> 
> Cheers,
> Frédéric.
> 
> PS: Since P introduced RocksDB column families, could "slow operation 
> observed for" after upgrading from O to Q be related to non-resharded OSDs?

I naively thought that update addressed those, but if it grandfathers instead 
that would be good to note.

> 
> 
> ----- Le 2 Mai 25, à 17:42, Maged Mokhtar mmokh...@petasan.org a écrit :
> 
>> On 02/05/2025 13:57, Frédéric Nass wrote:
>>> To clarify, there's no "issue" with the code itself. It's just that the 
>>> code now
>>> reveals a potential "issue" with the OSD's underlying device, as Igor
>>> explained.
>>> 
>>> This warning can pop up starting from Quincy v17.2.8 (PR 59468), Reef 
>>> v18.2.5
>>> (PR #59466) and Squid v19.2.1 (PR #59464).
>>> 
>>> Regards,
>>> Frédéric.
>> 
>> 
>> Thanks Igor and Frederic for the clarifications.
>> 
>> However, this begs the question, what should users do see-ing such slow
>> ops?
>> From quoted link:
>> https://docs.ceph.com/en/latest/rados/operations/health-checks/#bluestore-slow-op-alert
>> Which states it could be a drive issue, but not always...
>> 
>> So i think it could be helpful to share information/experiences of what
>> users find to be the root cause of such issues.
>> From our side:
>> 
>> 1) With Octopus and earlier, we rarely saw such logs, and when they
>> happened, it was mainly bad drives.
>> 
>> 2) When we made an upgrade from Octopus->Quincy, we started to see more
>> users complain.
>> The complaint was not always due to a warning, but generally slower
>> performance + higher latencies seen on charts + we can see it in the
>> logs for a time period like:
>> grep -r "slow operation observed for" /var/log/ceph  | grep "2024-11"
>> 
>> 3) Many users with issue, reported improvement when they stopped/reduced
>> bulk deletions like heavy patterns of block rbd trim/discard/reclaim.
>> This recommendation was influenced by messages from Igor and Mark Nelson
>> on slow bulk deletions.
>> It was also noticeable that after stopping trim, the cluster will not
>> report issues even at significantly higher client load.
>> This constituted the larger portion of issues we saw.
>> 
>> 4) Generally performing an offline db compaction also helped:
>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-XX compact
>> 
>> 5) For non-db related warnings, some older OSDs had high fragmentation
>> ceph daemon osd.XX bluestore allocator score block
>> Deleting and re-adding the same drive helped slow ops.
>> 
>> 6) To a lesser extent, the logs do indicate a defective drive or a drive
>> with a different model/type that has much less performance than the
>> other models in cluster/pool.
>> 
>> 
>> /Maged
>> 
>> 
>>> 
>>> ----- Le 2 Mai 25, à 12:36, Eugen Block ebl...@nde.ag a écrit :
>>> 
>>>> The link Frederic shared is for 19.2.1, so yes, the new warning
>>>> appeared in 19.2.1 as well.
>>>> 
>>>> Zitat von Laimis Juzeliūnas <laimis.juzeliu...@oxylabs.io>:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Could this also be an issue with 19.2.2?
>>>>> We have seen few of these warnings right after upgrading from
>>>>> 19.2.0. A simple OSD restart removed them, but we haven’t seen them
>>>>> before.
>>>>> There are some users on the Ceph Slack channels discussing this
>>>>> observation in 19.2.2 as well.
>>>>> 
>>>>> Best,
>>>>> Laimis J.
>>>>> 
>>>>>> On 2 May 2025, at 13:11, Igor Fedotov <igor.fedo...@croit.io> wrote:
>>>>>> 
>>>>>> Hi Everyone,
>>>>>> 
>>>>>> well, indeed this warning has been introduced in 18.2.6.
>>>>>> 
>>>>>> But I wouldn't say that's not an issue. Having it permanently
>>>>>> visible (particularly for a specific OSD only) might indicate some
>>>>>> issues with this OSD which could negatively impact overall cluster
>>>>>> performance.
>>>>>> 
>>>>>> OSD log to be checked for potential clues and more research on the
>>>>>> root cause is recommended.
>>>>>> 
>>>>>> And once again - likely that's not a regression in 18.2.6 but
>>>>>> rather some additional diagnostics brought by the release which
>>>>>> reveals a potential issue.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Igor
>>>>>> 
>>>>>> On 02.05.2025 11:19, Frédéric Nass wrote:
>>>>>>> Hi Michel,
>>>>>>> 
>>>>>>> This is not an issue. It's a new warning that can be adjusted or
>>>>>>> muted. Check this thread [1] and this part [2] of the Reef
>>>>>>> documentation about this new alert.
>>>>>>> Came to Reef with PR #59466 [3].
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Frédéric.
>>>>>>> 
>>>>>>> [1]
>>>>>>> https://www.google.com/url?q=https://www.spinics.net/lists/ceph-users/msg86131.html&source=gmail-imap&ust=1746785596000000&usg=AOvVaw27M4y8QaoDcRiJBkxDVVoK
>>>>>>> [2]
>>>>>>> https://www.google.com/url?q=https://docs.ceph.com/en/latest/rados/operations/health-checks/%23bluestore-slow-op-alert&source=gmail-imap&ust=1746785596000000&usg=AOvVaw21VoozoT2KT6FESbkkVJ_w
>>>>>>> [3]
>>>>>>> https://www.google.com/url?q=https://github.com/ceph/ceph/pull/59466&source=gmail-imap&ust=1746785596000000&usg=AOvVaw0nnpOvrWFLB1lAk0Ekms1i
>>>>>>> 
>>>>>>> ----- Le 2 Mai 25, à 9:44, Michel Jouvin
>>>>>>> michel.jou...@ijclab.in2p3.fr a écrit :
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Since our upgrade to 18.2.6 2 days ago, our cluster is reporting the
>>>>>>>> warning "1 OSD(s) experiencing slow operations in BlueStore":
>>>>>>>> 
>>>>>>>> [root@dig-osd4 bluestore-slow-ops]# ceph health detail
>>>>>>>> HEALTH_WARN 1 OSD(s) experiencing slow operations in BlueStore
>>>>>>>> [WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in
>>>>>>>> BlueStore
>>>>>>>>      osd.247 observed slow operation indications in BlueStore
>>>>>>>> 
>>>>>>>> I have never seen this warning before so I've the feeling it is somehow
>>>>>>>> related to the upgrade and it doesn't seem related to the regression
>>>>>>>> mentioned in another thread (that should result in an OSD crash).
>>>>>>>> Googling quickly, I found this reported on 19.2.1 with SSD where in my
>>>>>>>> case it is an HDD. I don't know if the workaround mentioned in the 
>>>>>>>> issue
>>>>>>>> (bdev_xxx_discard=true) also applies to 18.2.6...
>>>>>>>> 
>>>>>>>> Did somebody saw this in 18.2.x? Any recommandation? Our plan was,
>>>>>>>> according to best practicies described recently in another thread to
>>>>>>>> move from 18.2.2 to 18.2.6 and then from 18.2.6 to 19.2.2... Will 
>>>>>>>> 19.2.2
>>>>>>>> clear this issue (at the risk of others as it is probably not
>>>>>>>> widely used)?
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> 
>>>>>>>> Michel
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to