Hi,
I have some strange OSD problems. Before the weekend I started some
rsync tests over CephFS, on a cache pool with underlying EC KV pool.
Today the cluster is completely degraded:
[root@ceph003 ~]# ceph status
cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded;
168 pgs down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs
stuck degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs
stuck unclean; 403 pgs stuck undersized; 403 pgs undersized; 300
requests are blocked > 32 sec; recovery 15170/27902361 objects
degraded (0.054%); 1922/27902361 objects misplaced (0.007%); 1 near
full osd(s)
monmap e1: 3 mons at
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 8, quorum 0,1,2
ceph001,ceph002,ceph003
mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
osdmap e719: 48 osds: 18 up, 18 in
pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
2282 GB used, 31397 GB / 33680 GB avail
15170/27902361 objects degraded (0.054%); 1922/27902361
objects misplaced (0.007%)
68 down+remapped+peering
1 active
754 active+clean
1 stale+incomplete
1 stale+active+clean+scrubbing
14 active+undersized+degraded+remapped
7 incomplete
100 down+peering
9 active+remapped
59 stale+active+undersized+degraded
19 active+undersized+degraded+remapped+backfill_toofull
311 active+undersized+degraded
I tried to figure out what happened in the global logs:
2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF]
pgmap v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,
126 TB / 129 TB avail; 4159 kB/s wr, 45 op/s
2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF]
pgmap v65893: 1344 pgs: 1344
2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF]
pgmap v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,
126 TB / 129 TB avail; 561 kB/s wr, 11 op/s
2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF]
osd.19 10.141.8.181:6809/29664 failed (3 reports from 3 peers after
20.000079 >= grace 20.000000)
2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF]
osdmap e117: 48 osds: 47 up, 48 in
2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF]
pgmap v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,
126 TB / 129 TB avail; 1353 kB/s wr, 13 op/s
2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF]
pgmap v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean;
2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s
2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF]
osdmap e118: 48 osds: 47 up, 48 in
2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF]
pgmap v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean;
2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s
2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF]
pgmap v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean;
2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s
2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF]
pgmap v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean;
2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s
2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF]
pgmap v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean;
2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s
2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN]
1 slow requests, 1 included below; oldest blocked for > 30.000137 secs
2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN]
slow request 30.000137 seconds old, received at 2014-09-13
08:00:49.335339: osd_op(client.7448.1:17751783 10000203eac.0000000e
[write 0~319488 [1@-1],startsync 0~0] 1.b
6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg
2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN]
7 slow requests, 6 included below; oldest blocked for > 31.001947 secs
2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN]
slow request 30.998110 seconds old, received at 2014-09-13
08:00:49.339176: osd_op(client.7448.1:17751787 10000203eac.0000000e
[write 319488~65536 [1@-1],startsync 0~0]
This is happening OSD after OSD..
I tried to check the individual log of the osds, but all the
individual logs stop abruptly (also from the osds that are still
running):
2014-09-12 14:25:51.205276 7f3517209700 0 log [WRN] : 41 slow
requests, 1 included below; oldest blocked for > 38.118088 secs
2014-09-12 14:25:51.205337 7f3517209700 0 log [WRN] : slow request
36.558286 seconds old, received at 2014-09-12 14:25:14.646836:
osd_op(client.7448.1:2458392 1000006328f.0000000b [write
3989504~204800 [1@-1],startsync 0~0] 1.9337bf4b snapc 1=[]
ondisk+write e116) currently reached pg
2014-09-12 14:25:53.205586 7f3517209700 0 log [WRN] : 30 slow
requests, 1 included below; oldest blocked for > 40.118530 secs
2014-09-12 14:25:53.205679 7f3517209700 0 log [WRN] : slow request
30.541026 seconds old, received at 2014-09-12 14:25:22.664538:
osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691
[1@-1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116)
currently reached pg
2014-09-12 17:52:40.503917 7f34e8ed2700 0 -- 10.141.8.181:6809/29664
>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0
cs=0 l=1 c=0x1bc8b9c0).accept replacing existing (lossy) channel (new
one lossy=1)
I *think* the absence of the logs is some issue related to another
issue I just found (http://tracker.ceph.com/issues/9470).
So I can't found out the original problem with the log files..
Is there any other way I can find out what started the crashing of 30 osds ?
Thanks!!
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com