Had two users come at me with "why didn't you...?" because of a machine that had disk hardware failures, but no alerts before the device died. They pointed at these messages in the kernel dmesg:
> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300 > [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset: CSTS=0x2, PCI_STATUS=0x10 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1, sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev nvme3n1, logical block 390703424, async page read > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 0 > [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1, sector 256 I didn't find an "errors" counter in iostats[1] so I can guess node_exporter won't have it. I did find node_filesystem_device_error but that was zero the whole time. What would be the prometheus-y way to sense these errors so my users can have their alerts?" I'm hoping to avoid "logtail | grep -c 'error' " in a counter. [1: https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ] -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CACDZGiKxT-kKodJQe44TL5-DRKwZ5fpazPhvkb4FijGS8iWjsQ%40mail.gmail.com.

