It looks like you may be able to work around the issue for the moment with ceph osd set nodeep-scrub
as it looks like it is scrub that is getting stuck? sage On Fri, 6 Mar 2015, Quentin Hartman wrote: > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ > an osd crash log (in github gist because it was too big for pastebin) - > https://gist.github.com/qhartman/cb0e290df373d284cfb5 > > And now I've got four OSDs that are looping..... > > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman > <qhart...@direwolfdigital.com> wrote: > So I'm in the middle of trying to triage a problem with my ceph > cluster running 0.80.5. I have 24 OSDs spread across 8 machines. > The cluster has been running happily for about a year. This last > weekend, something caused the box running the MDS to sieze hard, > and when we came in on monday, several OSDs were down or > unresponsive. I brought the MDS and the OSDs back on online, and > managed to get things running again with minimal data loss. Had > to mark a few objects as lost, but things were apparently > running fine at the end of the day on Monday. > This afternoon, I noticed that one of the OSDs was apparently stuck in > a crash/restart loop, and the cluster was unhappy. Performance was in > the tank and "ceph status" is reporting all manner of problems, as one > would expect if an OSD is misbehaving. I marked the offending OSD out, > and the cluster started rebalancing as expected. However, I noticed a > short while later, another OSD has started into a crash/restart loop. > So, I repeat the process. And it happens again. At this point I > notice, that there are actually two at a time which are in this state. > > It's as if there's some toxic chunk of data that is getting passed > around, and when it lands on an OSD it kills it. Contrary to that, > however, I tried just stopping an OSD when it's in a bad state, and > once the cluster starts to try rebalancing with that OSD down and not > previously marked out, another OSD will start crash-looping. > > I've investigated the disk of the first OSD I found with this problem, > and it has no apparent corruption on the file system. > > I'll follow up to this shortly with links to pastes of log snippets. > Any input would be appreciated. This is turning into a real cascade > failure, and I haven't any idea how to stop it. > > QH > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com