[ceph-users] OSD troubles on FS+Tiering

Kenneth Waegeman Mon, 15 Sep 2014 07:19:04 -0700

Hi,

I have some strange OSD problems. Before the weekend I started somersync tests over CephFS, on a cache pool with underlying EC KV pool.Today the cluster is completely degraded:


[root@ceph003 ~]# ceph status
    cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d

health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded;168 pgs down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgsstuck degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgsstuck unclean; 403 pgs stuck undersized; 403 pgs undersized; 300requests are blocked > 32 sec; recovery 15170/27902361 objectsdegraded (0.054%); 1922/27902361 objects misplaced (0.007%); 1 nearfull osd(s)monmap e1: 3 mons at{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch 8, quorum 0,1,2ceph001,ceph002,ceph003

     mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
     osdmap e719: 48 osds: 18 up, 18 in
      pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
            2282 GB used, 31397 GB / 33680 GB avail

15170/27902361 objects degraded (0.054%); 1922/27902361objects misplaced (0.007%)

                  68 down+remapped+peering
                   1 active
                 754 active+clean
                   1 stale+incomplete
                   1 stale+active+clean+scrubbing
                  14 active+undersized+degraded+remapped
                   7 incomplete
                 100 down+peering
                   9 active+remapped
                  59 stale+active+undersized+degraded
                  19 active+undersized+degraded+remapped+backfill_toofull
                 311 active+undersized+degraded

I tried to figure out what happened in the global logs:

2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF]pgmap v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,126 TB / 129 TB avail; 4159 kB/s wr, 45 op/s2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF]pgmap v65893: 1344 pgs: 13442014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF]pgmap v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,126 TB / 129 TB avail; 561 kB/s wr, 11 op/s2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF]osd.19 10.141.8.181:6809/29664 failed (3 reports from 3 peers after20.000079 >= grace 20.000000)2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF]osdmap e117: 48 osds: 47 up, 48 in2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF]pgmap v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used,126 TB / 129 TB avail; 1353 kB/s wr, 13 op/s2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF]pgmap v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean;2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF]osdmap e118: 48 osds: 47 up, 48 in2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF]pgmap v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean;2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF]pgmap v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean;2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF]pgmap v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean;2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF]pgmap v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean;2606 GB data, 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN]1 slow requests, 1 included below; oldest blocked for > 30.000137 secs2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN]slow request 30.000137 seconds old, received at 2014-09-1308:00:49.335339: osd_op(client.7448.1:17751783 10000203eac.0000000e[write 0~319488 [1@-1],startsync 0~0] 1.b

6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg

2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN]7 slow requests, 6 included below; oldest blocked for > 31.001947 secs2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN]slow request 30.998110 seconds old, received at 2014-09-1308:00:49.339176: osd_op(client.7448.1:17751787 10000203eac.0000000e[write 319488~65536 [1@-1],startsync 0~0]



This is happening OSD after OSD..

I tried to check the individual log of the osds, but all theindividual logs stop abruptly (also from the osds that are stillrunning):

2014-09-12 14:25:51.205276 7f3517209700 0 log [WRN] : 41 slowrequests, 1 included below; oldest blocked for > 38.118088 secs2014-09-12 14:25:51.205337 7f3517209700 0 log [WRN] : slow request36.558286 seconds old, received at 2014-09-12 14:25:14.646836:osd_op(client.7448.1:2458392 1000006328f.0000000b [write3989504~204800 [1@-1],startsync 0~0] 1.9337bf4b snapc 1=[]ondisk+write e116) currently reached pg2014-09-12 14:25:53.205586 7f3517209700 0 log [WRN] : 30 slowrequests, 1 included below; oldest blocked for > 40.118530 secs2014-09-12 14:25:53.205679 7f3517209700 0 log [WRN] : slow request30.541026 seconds old, received at 2014-09-12 14:25:22.664538:osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691[1@-1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116)currently reached pg2014-09-12 17:52:40.503917 7f34e8ed2700 0 -- 10.141.8.181:6809/29664>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0cs=0 l=1 c=0x1bc8b9c0).accept replacing existing (lossy) channel (newone lossy=1)

I *think* the absence of the logs is some issue related to anotherissue I just found (http://tracker.ceph.com/issues/9470).


So I can't found out the original problem with the log files..

Is there any other way I can find out what started the crashing of 30 osds ?

Thanks!!

Kenneth

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD troubles on FS+Tiering

Reply via email to