Had two users come at me with "why didn't you...?" because of a machine
that had disk
hardware failures, but no alerts before the device died.  They pointed at
these messages
in the kernel dmesg:

> [Wed May 17 06:07:05 2023] nvme nvme3: async event result 00010300
> [Wed May 17 06:07:25 2023] nvme nvme3: controller is down; will reset:
CSTS=0x2, PCI_STATUS=0x10
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392
> [Wed May 17 11:56:04 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 3125627392 > [Thu May 18 08:06:04 2023] Buffer I/O error on dev
nvme3n1, logical block 390703424, async page read
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 0
> [Thu May 18 08:07:37 2023] print_req_error: I/O error, dev nvme3c33n1,
sector 256 I didn't find an "errors" counter in iostats[1] so I can guess
node_exporter won't have it. I did find node_filesystem_device_error but
that was zero the whole time. What would be the prometheus-y way to sense
these errors so my users can have their alerts?" I'm hoping to avoid
"logtail | grep -c 'error' " in a counter. [1:
https://www.kernel.org/doc/html/latest/admin-guide/iostats.html ]

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CACDZGiKxT-kKodJQe44TL5-DRKwZ5fpazPhvkb4FijGS8iWjsQ%40mail.gmail.com.

Reply via email to