Any other suggestions why several osds are going down on Giant and causing IO to stall? This was not happening on Firefly.
Thank s ----- Original Message ----- On 11/14/2014 01:50 PM, Andrei Mikhailovsky wrote: > Wido, I've not done any changes from the default settings. There are no > firewalls between the ceph cluster members and I do not see a great deal of > network related errors either. There is a tiny amount of TX errors on the > network interface, which accounts to 0.0001% of the total packets. > > Regarding the debug_ms , do I set this on ceph-mon side or on all osds or > both? > > On the OSDs, it might explain why the heartbeats are failing. Wido, I have set the debug_ms to 10 on all osds, however, the disk usage has quickly increased and consumed about 5gb of space in the matter of 10 minutes. I am no longer able to keep the debug level as I will quickly run out of space on the partition. From what I can see, 5 osds have generated over 1GB log files where the rest has only generated about 200-300 GB. Is there anything in particular I should be looking for to determine the cause of the hearbeats failures? Thanks > From: "Wido den Hollander" <w...@42on.com> > To: ceph-users@lists.ceph.com > Sent: Friday, 14 November, 2014 12:30:02 PM > Subject: Re: [ceph-users] Giant osd problems - loss of IO > > On 14-11-14 13:26, Andrei Mikhailovsky wrote: >> Hello guys, >> >> Since upgrading my cluster to Giant from the previous stable release I >> started having massive problems with client IO. I've done the upgrade 2 >> weeks ago and since then the IO on my ceph cluster has been unavailable >> 3 times already. >> >> Quick info on my storage cluster - 3 mons, 2 osd servers, 17 osds, 4 >> ssds for journals. All running Giant release from ceph repos on Ubuntu >> 12.04 with the latest updates. >> >> Every time I experience a very similar problem. Two out of 17 osds are >> shown as down even though the corresponding ceph-osd processes are >> running. Somehow, the down-ing of two osds brings down the IO in the >> cluster. This has not happened before upgrading to Giant. It seems that >> everytime different osds go down, but it happens on the same osd server. >> The latest one happened tonight. The ceph -s showed the following: >> >> root@arh-ibstorage1-ib:~# ceph -s >> cluster 51e9f641-372e-44ec-92a4-b9fe55cbf9fe >> health HEALTH_WARN 1001 pgs degraded; 150 pgs down; 150 pgs >> peering; 1 pgs recovering; 41 pgs recovery_wait; 1001 pgs stuck >> degraded; 150 pgs stuck inactive; 1151 pgs stuck unclean; 967 pgs stuck >> undersized; 967 pgs undersized; 203 requests are blocked > 32 sec; >> recovery 661954/5994334 objects degraded (11.043%); 10/2997167 unfound >> (0.000%); 2/17 in osds are down; noout flag(s) set >> monmap e11: 3 mons at >> {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}, >> >> election epoch 66832, quorum 0,1,2 >> arh-cloud13-ib,arh-ibstorage1-ib,arh-ibstorage2-ib >> mdsmap e2709: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby >> osdmap e64374: 17 osds: 15 up, 17 in >> flags noout >> pgmap v39540041: 4412 pgs, 17 pools, 10245 GB data, 2926 kobjects >> 20570 GB used, 26853 GB / 47423 GB avail >> 661954/5994334 objects degraded (11.043%); 10/2997167 >> unfound (0.000%) >> 1 active+recovering+undersized+degraded >> 3261 active+clean >> 34 active+recovery_wait+degraded >> 959 active+undersized+degraded >> 7 active+recovery_wait+undersized+degraded >> 150 down+peering >> client io 0 B/s rd, 13805 B/s wr, 4 op/s >> >> >> "ceph osd tree" shows: >> >> root@arh-ibstorage1-ib:~# ceph osd tree >> # id weight type name up/down reweight >> -10 1.8 root uk-ssd >> -9 1.8 datacenter ldex-ssd >> -12 1.8 room ldex-dc3-ssd >> -14 1.8 row row-a-ssd >> -8 1.8 rack ldex-rack-a5-ssd >> -6 0.9 host arh-ibstorage1-ib-ssd >> -7 0.9 host arh-ibstorage2-ib-ssd >> -1 46.41 root uk >> -5 46.41 datacenter ldex >> -11 46.41 room ldex-dc3 >> -13 46.41 row row-a >> -4 46.41 rack ldex-rack-a5 >> -2 24.57 host arh-ibstorage1-ib >> 1 2.73 osd.1 up 1 >> 3 2.73 osd.3 up 1 >> 5 2.73 osd.5 up 1 >> 6 2.73 osd.6 up 1 >> 7 2.73 osd.7 up 1 >> 8 2.73 osd.8 up 1 >> 4 2.73 osd.4 up 1 >> 0 2.73 osd.0 up 1 >> 2 2.73 osd.2 up 1 >> -3 21.84 host arh-ibstorage2-ib >> 9 2.73 osd.9 up 1 >> 10 2.73 osd.10 up 1 >> 11 2.73 osd.11 up 1 >> 12 2.73 osd.12 down 1 >> 13 2.73 osd.13 up 1 >> 14 2.73 osd.14 up 1 >> 15 2.73 osd.15 up 1 >> 16 2.73 osd.16 down 1 >> >> >> From the ceph -s it seems that some PGs were down and peering. There was >> no change in the ceph -s output for over an hour. The client IO activity >> ground to a halt and all vms became unresponsive and eventually crashed. >> The cluster was unusable until I have manually stopped osd.12 and osd.16 >> and restarted them again. This has solved the problem. >> >> Previously this has happened about four days ago with osds 11 and 15 and >> it also happened about 10 days ago with different osds. So, having lost >> the storage IO three times in the last two weeks is a big deal. >> >> Looking at the ceph logs I've discovered that since performing the >> upgrade I have seen a far more messages reporting osds being down. >> Different osds are reported as being down and they come back up after >> about 5-30 seconds, perform recovery and the cluster is back to normal >> (apart from the three tiimes which actually brought the cluster io down). >> >> Furthermore, about a few minutes before osd.12 and osd.16 went down, >> I've seen 2 PGs are being deep scrubbed while osd.7 was reported down by >> 3 peers: >> >> >> >> 2014-11-13 20:55:36.608502 mon.0 192.168.168.13:6789/0 1210074 : cluster >> [INF] pgmap v39509396: 4412 pgs: 4410 active+clean, 2 >> active+clean+scrubbing+deep; 10245 GB data, 20570 GB used, 26853 GB / >> 47423 GB avail; 10005 kB/s rd, 3486 kB/s wr, 310 op/s >> 2014-11-13 20:56:10.056933 mon.0 192.168.168.13:6789/0 1210209 : cluster >> [INF] pgmap v39509409: 4412 pgs: 3380 active+clean, 2 >> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 112 kB/s rd, 122 kB/s >> wr, 37 op/s; 662204/5994308 objects degraded (11.047%) >> 2014-11-13 20:56:10.073317 mon.0 192.168.168.13:6789/0 1210210 : cluster >> [INF] osd.7 192.168.168.200:6838/26321 boot >> 2014-11-13 20:56:10.073461 mon.0 192.168.168.13:6789/0 1210211 : cluster >> [INF] osd.3 192.168.168.200:6818/26170 boot >> 2014-11-13 20:56:10.074415 mon.0 192.168.168.13:6789/0 1210212 : cluster >> [INF] osdmap e64099: 17 osds: 17 up, 17 in >> 2014-11-13 20:56:10.109650 mon.0 192.168.168.13:6789/0 1210213 : cluster >> [INF] pgmap v39509410: 4412 pgs: 3380 active+clean, 2 >> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 97137 B/s rd, 220 kB/s >> wr, 36 op/s; 662204/5994308 objects degraded (11.047%) >> 2014-11-13 20:56:11.082817 mon.0 192.168.168.13:6789/0 1210214 : cluster >> [INF] osdmap e64100: 17 osds: 17 up, 17 in >> 2014-11-13 20:56:11.741135 mon.1 192.168.168.200:6789/0 520573 : cluster >> [INF] mon.arh-ibstorage1-ib calling new monitor election >> 2014-11-13 20:56:11.744641 mon.2 192.168.168.201:6789/0 521421 : cluster >> [INF] mon.arh-ibstorage2-ib calling new monitor election >> 2014-11-13 20:56:11.745988 mon.0 192.168.168.13:6789/0 1210248 : cluster >> [INF] mon.arh-cloud13-ib calling new monitor election >> 2014-11-13 20:56:11.754388 mon.0 192.168.168.13:6789/0 1210249 : cluster >> [INF] mon.arh-cloud13-ib@0 won leader election with quorum 0,1,2 >> 2014-11-13 20:56:11.763975 mon.0 192.168.168.13:6789/0 1210250 : cluster >> [INF] monmap e11: 3 mons at >> {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0} >> >> 2014-11-13 20:56:11.764069 mon.0 192.168.168.13:6789/0 1210251 : cluster >> [INF] pgmap v39509411: 4412 pgs: 3380 active+clean, 2 >> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects >> degraded (11.047%) >> 2014-11-13 20:56:11.764147 mon.0 192.168.168.13:6789/0 1210252 : cluster >> [INF] mdsmap e2705: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby >> 2014-11-13 20:56:11.764283 mon.0 192.168.168.13:6789/0 1210253 : cluster >> [INF] osdmap e64100: 17 osds: 17 up, 17 in >> 2014-11-13 20:56:11.764859 mon.0 192.168.168.13:6789/0 1210257 : cluster >> [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from 3 peers >> after 32.005619 >= grace 25.962628) >> 2014-11-13 20:56:12.721171 mon.0 192.168.168.13:6789/0 1210262 : cluster >> [INF] osdmap e64101: 17 osds: 16 up, 17 in >> 2014-11-13 20:56:13.294162 mon.0 192.168.168.13:6789/0 1210263 : cluster >> [INF] pgmap v39509413: 4412 pgs: 3380 active+clean, 2 >> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 1150 kB/s rd, 561 kB/s >> wr, 193 op/s; 662204/5994308 objects degraded (11.047%) >> 2014-11-13 20:56:14.325109 mon.0 192.168.168.13:6789/0 1210264 : cluster >> [INF] osdmap e64102: 17 osds: 16 up, 17 in >> 2014-11-13 20:56:14.357819 mon.0 192.168.168.13:6789/0 1210265 : cluster >> [INF] pgmap v39509414: 4412 pgs: 3380 active+clean, 2 >> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects >> degraded (11.047%) >> 2014-11-13 20:56:15.442751 mon.0 192.168.168.13:6789/0 1210266 : cluster >> [INF] pgmap v39509415: 4412 pgs: 3685 active+clean, 2 >> active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 >> active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / >> 47423 GB avail; 28730 kB/s rd, 1087 kB/s wr, 590 op/s; 466784/5994308 >> objects degraded (7.787%); 499 MB/s, 3 keys/s, 237 objects/s recovering >> 2014-11-13 20:56:16.364658 mon.0 192.168.168.13:6789/0 1210268 : cluster >> [INF] osd.7 192.168.168.200:6838/26321 boot >> 2014-11-13 20:56:16.366310 mon.0 192.168.168.13:6789/0 1210269 : cluster >> [INF] osdmap e64103: 17 osds: 17 up, 17 in >> 2014-11-13 20:56:16.380120 mon.0 192.168.168.13:6789/0 1210270 : cluster >> [INF] pgmap v39509416: 4412 pgs: 3685 active+clean, 2 >> active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 >> active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / >> 47423 GB avail; 33165 kB/s rd, 1255 kB/s wr, 682 op/s; 466784/5994308 >> objects degraded (7.787%); 576 MB/s, 3 keys/s, 274 objects/s recovering >> 2014-11-13 20:56:17.400859 mon.0 192.168.168.13:6789/0 1210271 : cluster >> [INF] osdmap e64104: 17 osds: 17 up, 17 in >> >> >> >> So, for the next 20 seconds it was scrubbing until ceph reported that >> osd12 and osd16 reported being down: >> >> >> >> 2014-11-13 20:56:36.712346 mon.0 192.168.168.13:6789/0 1210327 : cluster >> [INF] osd.12 192.168.168.201:6810/10278 failed (11 reports from 9 peers >> after 24.080479 >= grace 23.026948) >> 2014-11-13 20:56:36.771124 mon.0 192.168.168.13:6789/0 1210330 : cluster >> [INF] osd.15 192.168.168.201:6801/4482 failed (10 reports from 8 peers >> after 24.118287 >= grace 23.285583) >> 2014-11-13 20:56:37.920306 mon.0 192.168.168.13:6789/0 1210345 : cluster >> [INF] pgmap v39509434: 4412 pgs: 122 peering, 4085 active+clean, 2 >> active+clean+scrubbing+deep, 2 active+degraded, 16 >> active+recovery_wait+degraded, 185 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 13943 B/s rd, 506 kB/s >> wr, 86 op/s; 118552/5994302 objects degraded (1.978%); 666 kB/s, 0 >> objects/s recovering >> 2014-11-13 20:56:37.963713 mon.0 192.168.168.13:6789/0 1210346 : cluster >> [INF] osdmap e64106: 17 osds: 15 up, 17 in >> 2014-11-13 20:56:38.458852 mon.0 192.168.168.13:6789/0 1210347 : cluster >> [INF] pgmap v39509435: 4412 pgs: 377 stale+active+clean, 122 peering, >> 3708 active+clean, 119 stale+active+undersized+degraded, 2 >> active+clean+scrubbing+deep, 2 active+degraded, 16 >> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 20510 B/s rd, 430 kB/s >> wr, 76 op/s; 118552/5994302 objects degraded (1.978%) >> 2014-11-13 20:56:39.195535 mon.0 192.168.168.13:6789/0 1210349 : cluster >> [INF] osdmap e64107: 17 osds: 15 up, 17 in >> 2014-11-13 20:56:39.231384 mon.0 192.168.168.13:6789/0 1210350 : cluster >> [INF] pgmap v39509436: 4412 pgs: 377 stale+active+clean, 122 peering, >> 3708 active+clean, 119 stale+active+undersized+degraded, 2 >> active+clean+scrubbing+deep, 2 active+degraded, 16 >> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 118552/5994302 objects >> degraded (1.978%) >> 2014-11-13 20:56:41.883606 mon.0 192.168.168.13:6789/0 1210352 : cluster >> [INF] pgmap v39509437: 4412 pgs: 377 stale+active+clean, 122 peering, >> 3708 active+clean, 119 stale+active+undersized+degraded, 2 >> active+clean+scrubbing+deep, 1 active+degraded, 17 >> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 14341 B/s rd, 144 kB/s >> wr, 5 op/s; 118552/5994302 objects degraded (1.978%) >> 2014-11-13 20:56:43.617030 mon.0 192.168.168.13:6789/0 1210359 : cluster >> [INF] pgmap v39509438: 4412 pgs: 377 stale+active+clean, 122 peering, >> 3708 active+clean, 119 stale+active+undersized+degraded, 2 >> active+clean+scrubbing+deep, 1 active+degraded, 17 >> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 718 kB/s rd, 215 kB/s >> wr, 46 op/s; 118552/5994302 objects degraded (1.978%) >> 2014-11-13 20:56:45.408475 mon.0 192.168.168.13:6789/0 1210365 : cluster >> [INF] pgmap v39509439: 4412 pgs: 46 peering, 3218 active+clean, 1 >> active+clean+scrubbing+deep, 1 active+degraded, 12 >> active+recovery_wait+degraded, 1007 active+undersized+degraded, 1 >> active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB >> data, 20570 GB used, 26853 GB / 47423 GB avail; 1853 kB/s wr, 300 op/s; >> 696740/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%) >> 2014-11-13 20:56:46.517294 mon.0 192.168.168.13:6789/0 1210366 : cluster >> [INF] pgmap v39509440: 4412 pgs: 46 peering, 3222 active+clean, 1 >> active+clean+scrubbing+deep, 9 active+recovery_wait+degraded, 1007 >> active+undersized+degraded, 1 active+recovery_wait+undersized+degraded, >> 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB >> avail; 2242 kB/s wr, 358 op/s; 696737/5994304 objects degraded >> (11.623%); 8/2997152 unfound (0.000%) >> >> >> >> From this point on the client IO went down from an average of about 500 >> op/s to 10-20 op/s. All vms started having hang tasks shortly after that >> and eventually most of them crashed. A few more seconds passed, the >> status changed to: >> >> 2014-11-13 20:59:19.857378 mon.0 192.168.168.13:6789/0 1210590 : cluster >> [INF] pgmap v39509547: 4412 pgs: 1 >> active+recovering+undersized+degraded, 3268 active+clean, 19 >> active+degraded, 7 active+recovery_wait+degraded, 959 >> active+undersized+degraded, 7 active+recovery_wait+undersized+degraded, >> 150 down+peering, 1 active+recovering+degraded; 10245 GB data, 20570 GB >> used, 26853 GB / 47423 GB avail; 0 B/s rd, 546 kB/s wr, 103 op/s; >> 661943/5994302 objects degraded (11.043%); 10/2997151 unfound (0.000%); >> 53620 kB/s, 1 keys/s, 13 objects/s recovering >> >> >> And pretty much stayed like that for over an hour without changing much. >> The IO was still unavailable. >> >> Looking at the logs of the osds that were shown down, I don't really see >> much apart from some slow requests prior to the osds being reported down: >> >> >> 2014-11-13 20:56:10.315696 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 31.043640 seconds old, received at 2014-11-13 >> 20:55:39.271942: osd_op(client.46407343.0 >> :3839273 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint >> object_size 4194304 write_size 4194304,write 3526656~4096] 5.9fe0f0f2 >> ack+ondisk+write e64096) currently waiti >> ng for subops from 1 >> 2014-11-13 20:56:10.467380 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.467379) >> 2014-11-13 20:56:10.467389 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.467379) >> 2014-11-13 20:56:10.467409 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.467379) >> 2014-11-13 20:56:10.467427 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.8 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.467379) >> 2014-11-13 20:56:10.967963 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.967962) >> 2014-11-13 20:56:10.967971 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.967962) >> 2014-11-13 20:56:10.967978 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.967962) >> 2014-11-13 20:56:10.967984 7fd941a75700 -1 osd.12 64099 heartbeat_check: >> no reply from osd.8 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014 >> -11-13 20:55:50.967962) >> 2014-11-13 20:56:11.317046 7fd9645be700 -1 osd.12 64100 heartbeat_check: >> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043) >> 2014-11-13 20:56:11.317057 7fd9645be700 -1 osd.12 64100 heartbeat_check: >> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front >> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043) >> 2014-11-13 20:56:11.317064 7fd9645be700 -1 osd.12 64100 heartbeat_check: >> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front >> 2014-11-13 20:56:10.967590 (cutoff 2014-11-13 20:55:51.317043) >> 2014-11-13 20:56:11.317070 7fd9645be700 -1 osd.12 64100 heartbeat_check: >> no reply from osd.8 since back 2014-11-13 20:55:38.259184 front >> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043) >> 2014-11-13 20:56:11.317167 7fd9645be700 0 log_channel(default) log >> [WRN] : 17 slow requests, 7 included below; oldest blocked for > >> 32.285501 secs >> 2014-11-13 20:56:11.317173 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 32.285501 seconds old, received at 2014-11-13 >> 20:55:39.031582: osd_op(client.46379104.0:182402 >> rbd_data.2c26ec1333ab105.000000000000015a [stat,set-alloc-hint >> object_size 4194304 write_size 4194304,write 3956736~1024] 5.e9450c8d >> ack+ondisk+write e64096) currently waiting for subops from 4 >> 2014-11-13 20:56:11.317179 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 32.047721 seconds old, received at 2014-11-13 >> 20:55:39.269362: osd_op(client.46407343.0:3839258 >> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size >> 4194304 write_size 4194304,write 1568768~8192] 5.9fe0f0f2 >> ack+ondisk+write e64096) currently waiting for subops from 1 >> 2014-11-13 20:56:11.317183 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 32.047396 seconds old, received at 2014-11-13 >> 20:55:39.269687: osd_op(client.46407343.0:3839259 >> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size >> 4194304 write_size 4194304,write 1773568~4096] 5.9fe0f0f2 >> ack+ondisk+write e64096) currently waiting for subops from 1 >> 2014-11-13 20:56:11.317187 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 32.047229 seconds old, received at 2014-11-13 >> 20:55:39.269854: osd_op(client.46407343.0:3839260 >> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size >> 4194304 write_size 4194304,write 1921024~4096] 5.9fe0f0f2 >> ack+ondisk+write e64096) currently waiting for subops from 1 >> 2014-11-13 20:56:11.317232 7fd9645be700 0 log_channel(default) log >> [WRN] : slow request 32.047069 seconds old, received at 2014-11-13 >> 20:55:39.270014: osd_op(client.46407343.0:3839261 >> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size >> 4194304 write_size 4194304,write 1949696~4096] 5.9fe0f0f2 >> ack+ondisk+write e64096) currently waiting for subops from 1 >> 2014-11-13 20:56:37.954923 7fd92c055700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6820/56026102 pipe(0x2b369b00 sd=381 :6812 s=2 >> pgs=205156 cs=1 l=0 c=0x1b29d420).fault with nothing to send, going to >> standby >> 2014-11-13 20:56:37.955547 7fd92bb50700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6810/72026170 pipe(0x2ad35580 sd=238 :6812 s=2 >> pgs=141844 cs=1 l=0 c=0xf33c420).fault with nothing to send, going to >> standby >> 2014-11-13 20:56:37.956392 7fd935deb700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6807/70026130 pipe(0x2e2268c0 sd=408 :6812 s=2 >> pgs=278536 cs=1 l=0 c=0x142f7420).fault with nothing to send, going to >> standby >> 2014-11-13 20:56:37.956695 7fd93e86d700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6849/1026289 pipe(0x29b4c840 sd=129 :41140 s=2 >> pgs=591 cs=1 l=0 c=0x29b58b00).fault with nothing to send, going to standby >> 2014-11-13 20:56:37.958245 7fd93e56a700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6803/1026306 pipe(0x29bb7600 sd=154 :43005 s=2 >> pgs=117 cs=1 l=0 c=0x29be4840).fault with nothing to send, going to standby >> 2014-11-13 20:56:37.969506 7fd93e368700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6824/26238 pipe(0x29c98600 sd=146 :60953 s=2 pgs=148 >> cs=1 l=0 c=0x29cd0840).fault with nothing to send, going to standby >> 2014-11-13 20:56:37.969620 7fd92a439700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.201:6848/26784 pipe(0x2c5fcb00 sd=231 :42835 s=2 >> pgs=151539 cs=7 l=0 c=0x13c32580).fault with nothing to send, going to >> standby >> 2014-11-13 20:56:37.971391 7fd925df3700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6831/62026114 pipe(0x2d93edc0 sd=21 :6812 s=2 >> pgs=196755 cs=1 l=0 c=0x173c82c0).fault with nothing to send, going to >> standby >> 2014-11-13 20:56:37.980660 7fd93e267700 0 -- 192.168.168.201:6812/10278 >>>> 192.168.168.200:6821/18026333 pipe(0x2a13c2c0 sd=122 :41160 s=2 >> pgs=37077 cs=1 l=0 c=0x2a9ffdc0).fault with nothing to send, going to >> standby >> >> >> >> This was the last log entry for osd.12 until It has been restarted hours >> later. A very similar picture is in the logs of osd.16. Pretty much >> identical. >> >> Does anyone have any idea what is causing these problems with Giant >> release? How do I fix them as it is currently very unstable for a >> production ready release. >> > > The heartbeat cutoffs are odd. Are you sure there are no firewalls in > place there or any other TCP settings? > > Did you change any settings regarding the heartbeats inside Ceph or are > they all at their defaults? > > Have you tried to run with debug_ms set to 10 for example. That might > show you more. > > Wido > >> Many thanks for your help >> >> Andrei >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com