Hi everybody,

I have a situation that occurs under moderate I/O load on Ceph Luminous:

2018-07-10 10:27:01.257916 mon.node4 mon.0 172.16.0.4:6789/0 15590 :
cluster [INF] mon.node4 is new leader, mons node4,node5,node6,node7,node8
in quorum (ranks 0,1,2,3,4)
2018-07-10 10:27:01.306329 mon.node4 mon.0 172.16.0.4:6789/0 15595 :
cluster [INF] Health check cleared: MON_DOWN (was: 1/5 mons down, quorum
node4,node6,node7,node8)
2018-07-10 10:27:01.386124 mon.node4 mon.0 172.16.0.4:6789/0 15596 :
cluster [WRN] overall HEALTH_WARN 1 osds down; Reduced data availability: 1
pg peering; Degraded data redundancy: 58774/10188798 objects degraded
(0.577%), 13 pgs degraded; 412 slow requests are blocked > 32 sec
2018-07-10 10:27:02.598175 mon.node4 mon.0 172.16.0.4:6789/0 15597 :
cluster [WRN] Health check update: Degraded data redundancy: 77153/10188798
objects degraded (0.757%), 17 pgs degraded (PG_DEGRADED)
2018-07-10 10:27:02.598225 mon.node4 mon.0 172.16.0.4:6789/0 15598 :
cluster [WRN] Health check update: 381 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-07-10 10:27:02.598264 mon.node4 mon.0 172.16.0.4:6789/0 15599 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 1 pg peering)
2018-07-10 10:27:02.608006 mon.node4 mon.0 172.16.0.4:6789/0 15600 :
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-07-10 10:27:02.701029 mon.node4 mon.0 172.16.0.4:6789/0 15601 :
cluster [INF] osd.36 172.16.0.5:6800/3087 boot
2018-07-10 10:27:01.184334 osd.36 osd.36 172.16.0.5:6800/3087 23 : cluster
[WRN] Monitor daemon marked osd.36 down, but it is still running
2018-07-10 10:27:04.861372 mon.node4 mon.0 172.16.0.4:6789/0 15604 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 381 slow requests
are blocked > 32 sec)

The OSDs that seem to be affected are Intel SSDs, specific model is
SSDSC2BX480G4L.

I have throttled backups to try to lessen the situation, but it seems to
affect the same OSDs when it happens.  It has the added side effect of
taking down the mon on the same node for a few seconds and triggering a
monitor election.

I am wondering if this may be a firmware issue on this drive and if anyone
has any insight or additional troubleshooting steps I should try to get a
deeper look at this behavior.

I am going to upgrade firmware on these drives and see if it helps.

-- 
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 x1171
ivers...@rushville.k12.in.us
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to