Finally found an error that seems to provide some direction: -1> 2015-03-07 02:52:19.378808 7f175b1cf700 0 log [ERR] : scrub 3.18e e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does not match object info size (4120576) ajusted for ondisk to (4120576)
I'm diving into google now and hoping for something useful. If anyone has a suggestion, I'm all ears! QH On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman < qhart...@direwolfdigital.com> wrote: > Thanks for the suggestion, but that doesn't seem to have made a difference. > > I've shut the entire cluster down and brought it back up, and my config > management system seems to have upgraded ceph to 0.80.8 during the reboot. > Everything seems to have come back up, but I am still seeing the crash > loops, so that seems to indicate that this is definitely something > persistent, probably tied to the OSD data, rather than some weird transient > state. > > > On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net> wrote: > >> It looks like you may be able to work around the issue for the moment with >> >> ceph osd set nodeep-scrub >> >> as it looks like it is scrub that is getting stuck? >> >> sage >> >> >> On Fri, 6 Mar 2015, Quentin Hartman wrote: >> >> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with >> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ >> > an osd crash log (in github gist because it was too big for pastebin) - >> > https://gist.github.com/qhartman/cb0e290df373d284cfb5 >> > >> > And now I've got four OSDs that are looping..... >> > >> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman >> > <qhart...@direwolfdigital.com> wrote: >> > So I'm in the middle of trying to triage a problem with my ceph >> > cluster running 0.80.5. I have 24 OSDs spread across 8 machines. >> > The cluster has been running happily for about a year. This last >> > weekend, something caused the box running the MDS to sieze hard, >> > and when we came in on monday, several OSDs were down or >> > unresponsive. I brought the MDS and the OSDs back on online, and >> > managed to get things running again with minimal data loss. Had >> > to mark a few objects as lost, but things were apparently >> > running fine at the end of the day on Monday. >> > This afternoon, I noticed that one of the OSDs was apparently stuck in >> > a crash/restart loop, and the cluster was unhappy. Performance was in >> > the tank and "ceph status" is reporting all manner of problems, as one >> > would expect if an OSD is misbehaving. I marked the offending OSD out, >> > and the cluster started rebalancing as expected. However, I noticed a >> > short while later, another OSD has started into a crash/restart loop. >> > So, I repeat the process. And it happens again. At this point I >> > notice, that there are actually two at a time which are in this state. >> > >> > It's as if there's some toxic chunk of data that is getting passed >> > around, and when it lands on an OSD it kills it. Contrary to that, >> > however, I tried just stopping an OSD when it's in a bad state, and >> > once the cluster starts to try rebalancing with that OSD down and not >> > previously marked out, another OSD will start crash-looping. >> > >> > I've investigated the disk of the first OSD I found with this problem, >> > and it has no apparent corruption on the file system. >> > >> > I'll follow up to this shortly with links to pastes of log snippets. >> > Any input would be appreciated. This is turning into a real cascade >> > failure, and I haven't any idea how to stop it. >> > >> > QH >> > >> > >> > >> > >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com