Ceph health detail - http://pastebin.com/5URX9SsQ pg dump summary (with active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ an osd crash log (in github gist because it was too big for pastebin) - https://gist.github.com/qhartman/cb0e290df373d284cfb5
And now I've got four OSDs that are looping..... On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman < qhart...@direwolfdigital.com> wrote: > So I'm in the middle of trying to triage a problem with my ceph cluster > running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has > been running happily for about a year. This last weekend, something caused > the box running the MDS to sieze hard, and when we came in on monday, > several OSDs were down or unresponsive. I brought the MDS and the OSDs back > on online, and managed to get things running again with minimal data loss. > Had to mark a few objects as lost, but things were apparently running fine > at the end of the day on Monday. > > This afternoon, I noticed that one of the OSDs was apparently stuck in a > crash/restart loop, and the cluster was unhappy. Performance was in the > tank and "ceph status" is reporting all manner of problems, as one would > expect if an OSD is misbehaving. I marked the offending OSD out, and the > cluster started rebalancing as expected. However, I noticed a short while > later, another OSD has started into a crash/restart loop. So, I repeat the > process. And it happens again. At this point I notice, that there are > actually two at a time which are in this state. > > It's as if there's some toxic chunk of data that is getting passed around, > and when it lands on an OSD it kills it. Contrary to that, however, I tried > just stopping an OSD when it's in a bad state, and once the cluster starts > to try rebalancing with that OSD down and not previously marked out, > another OSD will start crash-looping. > > I've investigated the disk of the first OSD I found with this problem, and > it has no apparent corruption on the file system. > > I'll follow up to this shortly with links to pastes of log snippets. Any > input would be appreciated. This is turning into a real cascade failure, > and I haven't any idea how to stop it. > > QH >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com