Re: [ceph-users] Cascading Failure of OSDs

Quentin Hartman Fri, 06 Mar 2015 18:55:04 -0800

Finally found an error that seems to provide some direction:

-1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0) does
not match object info size (4120576) ajusted for ondisk to (4120576)


I'm diving into google now and hoping for something useful. If anyone has a
suggestion, I'm all ears!

QH

On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Thanks for the suggestion, but that doesn't seem to have made a difference.
>
> I've shut the entire cluster down and brought it back up, and my config
> management system seems to have upgraded ceph to 0.80.8 during the reboot.
> Everything seems to have come back up, but I am still seeing the crash
> loops, so that seems to indicate that this is definitely something
> persistent, probably tied to the OSD data, rather than some weird transient
> state.
>
>
> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net> wrote:
>
>> It looks like you may be able to work around the issue for the moment with
>>
>>  ceph osd set nodeep-scrub
>>
>> as it looks like it is scrub that is getting stuck?
>>
>> sage
>>
>>
>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>
>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary (with
>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>> > an osd crash log (in github gist because it was too big for pastebin) -
>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>> >
>> > And now I've got four OSDs that are looping.....
>> >
>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>> > <qhart...@direwolfdigital.com> wrote:
>> >       So I'm in the middle of trying to triage a problem with my ceph
>> >       cluster running 0.80.5. I have 24 OSDs spread across 8 machines.
>> >       The cluster has been running happily for about a year. This last
>> >       weekend, something caused the box running the MDS to sieze hard,
>> >       and when we came in on monday, several OSDs were down or
>> >       unresponsive. I brought the MDS and the OSDs back on online, and
>> >       managed to get things running again with minimal data loss. Had
>> >       to mark a few objects as lost, but things were apparently
>> >       running fine at the end of the day on Monday.
>> > This afternoon, I noticed that one of the OSDs was apparently stuck in
>> > a crash/restart loop, and the cluster was unhappy. Performance was in
>> > the tank and "ceph status" is reporting all manner of problems, as one
>> > would expect if an OSD is misbehaving. I marked the offending OSD out,
>> > and the cluster started rebalancing as expected. However, I noticed a
>> > short while later, another OSD has started into a crash/restart loop.
>> > So, I repeat the process. And it happens again. At this point I
>> > notice, that there are actually two at a time which are in this state.
>> >
>> > It's as if there's some toxic chunk of data that is getting passed
>> > around, and when it lands on an OSD it kills it. Contrary to that,
>> > however, I tried just stopping an OSD when it's in a bad state, and
>> > once the cluster starts to try rebalancing with that OSD down and not
>> > previously marked out, another OSD will start crash-looping.
>> >
>> > I've investigated the disk of the first OSD I found with this problem,
>> > and it has no apparent corruption on the file system.
>> >
>> > I'll follow up to this shortly with links to pastes of log snippets.
>> > Any input would be appreciated. This is turning into a real cascade
>> > failure, and I haven't any idea how to stop it.
>> >
>> > QH
>> >
>> >
>> >
>> >
>>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

Reply via email to