I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1.
Last friday I got everything deployed and all was working well, and I set
noout and shut all the OSD nodes down over the weekend. Yesterday when I
spun it back up, the OSDs were behaving very strangely, incorrectly marking
each other because of missed heartbeats, even though they were up. It
looked like some kind of low-level networking problem, but I couldn't find
any.

After much work, I narrowed the apparent source of the problem down to the
OSDs running on the first host I started in the morning. They were the ones
that were logged the most messages about not being able to ping other OSDs,
and the other OSDs were mostly complaining about them. After running out of
other ideas to try, I restarted them, and then everything started working.
It's still working happily this morning. It seems as though when that set
of OSDs started they got stale OSD map information from the MON boxes,
which failed to be updated as the other OSDs came up. Does that make sense?
I still don't consider myself an expert on ceph architecture and would
appreciate and corrections or other possible interpretations of events (I'm
happy to provide whatever additional information I can) so I can get a
deeper understanding of things. If my interpretation of events is correct,
it seems that might point at a bug.

QH
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to