Re: [ceph-users] Cascading Failure of OSDs

Quentin Hartman Fri, 06 Mar 2015 16:44:39 -0800

Ceph health detail - http://pastebin.com/5URX9SsQ
pg dump summary (with active+clean pgs removed) -
http://pastebin.com/Y5ATvWDZ
an osd crash log (in github gist because it was too big for pastebin) -
https://gist.github.com/qhartman/cb0e290df373d284cfb5


And now I've got four OSDs that are looping.....

On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> So I'm in the middle of trying to triage a problem with my ceph cluster
> running 0.80.5. I have 24 OSDs spread across 8 machines. The cluster has
> been running happily for about a year. This last weekend, something caused
> the box running the MDS to sieze hard, and when we came in on monday,
> several OSDs were down or unresponsive. I brought the MDS and the OSDs back
> on online, and managed to get things running again with minimal data loss.
> Had to mark a few objects as lost, but things were apparently running fine
> at the end of the day on Monday.
>
> This afternoon, I noticed that one of the OSDs was apparently stuck in a
> crash/restart loop, and the cluster was unhappy. Performance was in the
> tank and "ceph status" is reporting all manner of problems, as one would
> expect if an OSD is misbehaving. I marked the offending OSD out, and the
> cluster started rebalancing as expected. However, I noticed a short while
> later, another OSD has started into a crash/restart loop. So, I repeat the
> process. And it happens again. At this point I notice, that there are
> actually two at a time which are in this state.
>
> It's as if there's some toxic chunk of data that is getting passed around,
> and when it lands on an OSD it kills it. Contrary to that, however, I tried
> just stopping an OSD when it's in a bad state, and once the cluster starts
> to try rebalancing with that OSD down and not previously marked out,
> another OSD will start crash-looping.
>
> I've investigated the disk of the first OSD I found with this problem, and
> it has no apparent corruption on the file system.
>
> I'll follow up to this shortly with links to pastes of log snippets. Any
> input would be appreciated. This is turning into a real cascade failure,
> and I haven't any idea how to stop it.
>
> QH
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

Reply via email to