[ceph-users] Node failure -- corrupt memory

Shawn Iverson Mon, 11 Nov 2019 05:01:57 -0800

Hello Cephers!

I had a node over the weekend go nuts from what appears to have been
failed/bad memory modules and/or motherboard.


This resulted in several OSDs blocking IO for > 128s (indefinitely).

I was not watching my alerts too closely over the weekend, or else I may
have caught it early. The servers in the entire cluster reliant on ceph
stalled from the blocked IO on this failing node and had to be restarted
after taking the faulty node offline.

So, my question is, is there a way to tell ceph to start setting OSDs out
in the event of an IO blockage that exceeds a certain limit, or are there
risks in doing so that I would be better off dealing with a stalled ceph
cluster?

-- 
Shawn Iverson, CETL
Director of Technology
Rush County Schools
[email protected]

[image: Cybersecurity]

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Node failure -- corrupt memory

Reply via email to