I’ve done the same as well. It doesn’t help that smartctl doesn’t even try to have consistent output for SATA, SAS, and NVMe drives, and especially that it doesn’t enforce uniform attribute labels. Yes, the fundamental problem is that SMART is mechanism without policy, and inconsistently implemented, but smartctl doesn’t help like it could. I’m working on a post processor for derived.h to at least debulk the madness, e.g. casting all of the SSD life remaining/used metrics into the same name, and adding a primitive to subtract from 100. If upstream won’t take that addition, I guess I’ll be forking smartmoxntools in toto.
I had hopes for smartctl_exporter, but have given up on that project as too disorganized and contentious. So the last time I did this, I looked up the details of all 76 drive SKUs in my fleet and hardcoded every one into my collection script. Many systems cheap out and just use the overall pass/fail SMART health status attribute, which is very prone to reporting that clearly failing drives are just smurfy, and thus worthless. This is what BMCs seem to do, for example. Grown defects - on a spinner RMA or shred it if there are say more than 1 per 2TB. SATA downshift errors, though these can be HBA issues as well. UDMA/CRC errors - silently slow, but can be addressed by reseating in most cases, and using OEM carriers for SATA/SAS drives. Rate of reallocated blocks, alert if it isn’t quite slow Since Nautilus we are much less prone to grown errors then we used to be, the OSD will retry a failed write, so a remapped LBA will succeed the second time. There is IIRC a limit to how many of these are tolerated, but it does underscore the need to look deeper. Similarly one can alert on drives with high latency via primal query. Oh we were talking about the module. I vote to remove it. Note that the RH model is binary blobs as well as the ProphetStor model. Incomplete and impossible to maintain in the current state. It would be terrific to have a fully normalized and maintained / maintainable metric and prediction subsystem, but I haven’t seen anyone step up. It would be too much for me to do myself, especially without access to hardware, and I fear that without multiple people involved we won’t have continuity and it’ll lapse again. If at least one SWE with staying power steps up I’d be willing to entertain an idea for reimplementation eschewing opaque madness. anthonydatri@Mac models % pwd /Users/anthonydatri/git/ceph/src/pybind/mgr/diskprediction_local/models anthonydatri@Mac models % file redhat/* redhat/config.json: JSON data redhat/hgst_predictor.pkl: data redhat/hgst_scaler.pkl: data redhat/seagate_predictor.pkl: data redhat/seagate_scaler.pkl: data anthonydatri@Mac models % > On Apr 8, 2025, at 5:30 AM, Lukasz Borek <luk...@borek.org.pl> wrote: > > +1 > > I wasn't aware that this module is obsolete and was trying to start it a > few weeks ago. > > We develop a home-made solution some time ago to monitor smart data from > both HDD (uncorrected errors, grown defect list) and SSD (WLC/TBW). But > keeping it up to date with non-unified disk models is a nightmare. > > Alert : "OSD.12 is going to fail. Replace it soon" before seeing SLOW_OPS > would be a game changer! > > Thanks! > > On Tue, 8 Apr 2025 at 10:00, Michal Strnad <michal.str...@cesnet.cz> wrote: > >> Hi. >> >> From our point of view, it's important to keep disk failure prediction >> tool as part of Ceph, ideally as an MGR module. In environments with >> hundreds or thousands of disks, it's crucial to know whether, for >> example, a significant number of them are likely to fail within a month >> - which, in the best-case scenario, would mean performance degradation, >> and in the worst-case, data loss. >> >> Some have already responded to the deprecation of diskprediction by >> starting to develop their own solutions. For instance, just yesterday, >> Daniel Persson published a solution [1] on his website that addresses >> the same problem. >> >> Would it be possible to join forces and try to revive that module? >> >> [1] https://www.youtube.com/watch?v=Gr_GtC9dcMQ >> >> Thanks, >> Michal >> >> >> On 4/8/25 01:18, Yaarit Hatuka wrote: >>> Hi everyone, >>> >>> On today's Ceph Steering Committee call we discussed the idea of removing >>> the diskprediction_local mgr module, as the current prediction model is >>> obsolete and not maintained. >>> >>> We would like to gather feedback from the community about the usage of >> this >>> module, and find out if anyone is interested in maintaining it. >>> >>> Thanks, >>> Yaarit >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > > -- > Łukasz Borek > luk...@borek.org.pl > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io