I’ve done the same as well.

It doesn’t help that smartctl doesn’t even try to have consistent output for 
SATA, SAS, and NVMe drives, and especially that it doesn’t enforce uniform 
attribute labels.  Yes, the fundamental problem is that SMART is mechanism 
without policy, and inconsistently implemented, but smartctl doesn’t help like 
it could.  I’m working on a post processor for derived.h to at least debulk the 
madness, e.g. casting all of the SSD life remaining/used metrics into the same 
name, and adding a primitive to subtract from 100.  If upstream won’t take that 
addition, I guess I’ll be forking smartmoxntools in toto.

I had hopes for smartctl_exporter, but have given up on that project as too 
disorganized and contentious.

So the last time I did this, I looked up the details of all 76 drive SKUs in my 
fleet and hardcoded every one into my collection script.

Many systems cheap out and just use the overall pass/fail SMART health status 
attribute, which is very prone to reporting that clearly failing drives are 
just smurfy, and thus worthless.  This is what BMCs seem to do, for example.

Grown defects - on a spinner RMA or shred it if there are say more than 1 per 
2TB.

SATA downshift errors, though these can be HBA issues as well.

UDMA/CRC errors - silently slow, but can be addressed by reseating in most 
cases, and using OEM carriers for SATA/SAS drives.

Rate of reallocated blocks, alert if it isn’t quite slow

Since Nautilus we are much less prone to grown errors then we used to be, the 
OSD will retry a failed write, so a remapped LBA will succeed the second time.  
There is IIRC a limit to how many of these are tolerated, but it does 
underscore the need to look deeper.

Similarly one can alert on drives with high latency via primal query.

Oh we were talking about the module.  I vote to remove it.  Note that the RH 
model is binary blobs as well as the ProphetStor model.  Incomplete and 
impossible to maintain in the current state.

It would be terrific to have a fully normalized and maintained / maintainable 
metric and prediction subsystem, but I haven’t seen anyone step up.  It would 
be too much for me to do myself, especially without access to hardware, and I 
fear that without multiple people involved we won’t have continuity and it’ll 
lapse again.  If at least one SWE with staying power steps up I’d be willing to 
entertain an idea for reimplementation eschewing opaque madness.

anthonydatri@Mac models % pwd
/Users/anthonydatri/git/ceph/src/pybind/mgr/diskprediction_local/models
anthonydatri@Mac models % file redhat/*
redhat/config.json:           JSON data
redhat/hgst_predictor.pkl:    data
redhat/hgst_scaler.pkl:       data
redhat/seagate_predictor.pkl: data
redhat/seagate_scaler.pkl:    data
anthonydatri@Mac models %





> On Apr 8, 2025, at 5:30 AM, Lukasz Borek <luk...@borek.org.pl> wrote:
> 
> +1
> 
> I wasn't aware that this module is obsolete and was trying to start it a
> few weeks ago.
> 
> We develop a home-made  solution some time ago to monitor smart data from
> both HDD (uncorrected errors, grown defect list) and SSD (WLC/TBW). But
> keeping it up to date with non-unified disk models is a nightmare.
> 
> Alert : "OSD.12 is going to fail. Replace it soon" before seeing SLOW_OPS
> would be a game changer!
> 
> Thanks!
> 
> On Tue, 8 Apr 2025 at 10:00, Michal Strnad <michal.str...@cesnet.cz> wrote:
> 
>> Hi.
>> 
>> From our point of view, it's important to keep disk failure prediction
>> tool as part of Ceph, ideally as an MGR module. In environments with
>> hundreds or thousands of disks, it's crucial to know whether, for
>> example, a significant number of them are likely to fail within a month
>> - which, in the best-case scenario, would mean performance degradation,
>> and in the worst-case, data loss.
>> 
>> Some have already responded to the deprecation of diskprediction by
>> starting to develop their own solutions. For instance, just yesterday,
>> Daniel Persson published a solution [1] on his website that addresses
>> the same problem.
>> 
>> Would it be possible to join forces and try to revive that module?
>> 
>> [1] https://www.youtube.com/watch?v=Gr_GtC9dcMQ
>> 
>> Thanks,
>> Michal
>> 
>> 
>> On 4/8/25 01:18, Yaarit Hatuka wrote:
>>> Hi everyone,
>>> 
>>> On today's Ceph Steering Committee call we discussed the idea of removing
>>> the diskprediction_local mgr module, as the current prediction model is
>>> obsolete and not maintained.
>>> 
>>> We would like to gather feedback from the community about the usage of
>> this
>>> module, and find out if anyone is interested in maintaining it.
>>> 
>>> Thanks,
>>> Yaarit
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
> 
> -- 
> Łukasz Borek
> luk...@borek.org.pl
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to