We recently upgraded to 0.94.3 from firefly and now for the last week have
had intermittent slow requests and flapping OSDs.  We have been unable to
nail down the cause, but its feeling like it may be related to our osdmaps
not getting deleted properly.  Most of our osds are now storing over 100GB
of data in the meta directory, almost all of that is historical osd maps
going back over 7 days old.

We did do a small cluster change (We added 35 OSDs to a 1445 OSD cluster),
the rebalance took about 36 hours, and it completed 10 days ago.  Since
that time the cluster has been HEALTH_OK and all pgs have been active+clean
except for when we have an OSD flap.

When the OSDs flap they do not crash and restart, they just go unresponsive
for 1-3 minutes, and then come back alive all on their own.  They get
marked down by peers, and cause some peering and then they just come back
rejoin the cluster and continue on their merry way.

We see a bunch of this in the logs while the OSD is catatonic:

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 7f5b03679700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 7f5b04e7c700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 7f5b04e7c700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 7f5b03679700  1
heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' had timed out
after 15

Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 7f5b03679700 10
osd.1191 1203850 internal heartbeat not healthy, dropping ping request


I have a chunk of logs at debug 20/5, not sure if I should have done just
20... It's pretty hard to catch, we have to basically see the slow requests
and get debug logging set in about a 5-10 second window before the OSD
stops responding to the admin socket...

As networking is almost always the cause of flapping OSDs we have tested
the network quite extensively.  It hasn't changed physically since before
the hammer upgrade, and was performing well.  We have done large amounts of
ping tests and have not seen a single dropped packet between osd nodes or
between osd nodes and mons.

I don't see any error packets or drops on switches either.

Ideas?
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to