Re: [ceph-users] OSD's hang after network blip

Dan van der Ster Thu, 16 Jan 2020 01:16:50 -0800

Hi Nick,

We saw the exact same problem yesterday after a network outage -- a few of
our down OSDs were stuck down until we restarted their processes.


-- Dan


On Wed, Jan 15, 2020 at 3:37 PM Nick Fisk <n...@fisk.me.uk> wrote:

> Hi All,
>
> Running 14.2.5, currently experiencing some network blips isolated to a
> single rack which is under investigation. However, it appears following a
> network blip, random OSD's in unaffected racks are sometimes not recovering
> from the incident and are left running running in a zombie state. The OSD's
> appear to be running from a process perspective, but the cluster thinks
> they are down and will not rejoin the cluster until the OSD process is
> restarted, which incidentally takes a lot longer than usual (systemctl
> command takes a couple of minutes to complete).
>
> If the OSD is left in this state, CPU and memory usage of the process
> appears to climb, but never rejoins, at least for several hours that I have
> left them. Not exactly sure what the OSD is trying to do during this
> period. There's nothing in the logs during this hung state to indicate that
> anything is happening, but I will try and inject more verbose logging next
> time it occurs.
>
> Not sure if anybody has come across this before or any ideas? In the past
> as long as OSD's have been running they have always re-joint following any
> network issues.
>
> Nick
>
> Sample from OSD and cluster logs below. Blip happened at 12:06, I
> restarted OSD at 12:26
>
> OSD Logs from OSD that hung (Note this OSD was not directly affected by
> network outage)
> 2020-01-15 12:06:32.234 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:33.194 7f41a1023700 -1 osd.43 2342991 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [WRN] :
> Monitor daemon marked osd.43 down, but it is still running
> 2020-01-15 12:06:34.034 7f419480a700  0 log_channel(cluster) log [DBG] :
> map e2342992 wrongly marked me down at e2342992
> 2020-01-15 12:06:34.034 7f419480a700  1 osd.43 2342992
> start_waiting_for_healthy
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6854 osd.49 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6834 osd.51 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6862 osd.52 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6875 osd.53 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::13]:6894 osd.54 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
> 2020-01-15 12:06:34.198 7f41a1023700 -1 osd.43 2342992 heartbeat_check: no
> reply from [*:*:*:5::14]:6838 osd.71 ever on either front or back, first
> ping sent 2020-01-15 12:06:1
> 1.411216 (oldest deadline 2020-01-15 12:06:31.411216)
>
> Cluster logs
> 2020-01-15 12:06:09.740607 mon.mc-ceph-mon1 (mon.0) 531400 : cluster [DBG]
> osd.43 reported failed by osd.57
> 2020-01-15 12:06:09.945163 mon.mc-ceph-mon1 (mon.0) 531683 : cluster [DBG]
> osd.43 reported failed by osd.63
> 2020-01-15 12:06:09.945287 mon.mc-ceph-mon1 (mon.0) 531684 : cluster [INF]
> osd.43 failed (root=hdd,rack=8c-hdd,host=mc-8c-osd02-hdd) (2 reporters from
> different host after 21.006447 >= grace 20.118871)
> 2020-01-15 12:06:09.962867 mon.mc-ceph-mon1 (mon.0) 531775 : cluster [DBG]
> osd.43 reported failed by osd.49
> 2020-01-15 12:06:10.471837 mon.mc-ceph-mon1 (mon.0) 532231 : cluster [DBG]
> osd.43 reported failed by osd.190
> 2020-01-15 12:06:12.050928 mon.mc-ceph-mon1 (mon.0) 532421 : cluster [INF]
> osd.43 [v2:[*:*:*:5::12]:6808/1969300,v1:[*:*:*:5::12]:6809/1969300] boot
> 2020-01-15 12:06:11.192756 osd.43 (osd.43) 1675 : cluster [WRN] Monitor
> daemon marked osd.43 down, but it is still running
> 2020-01-15 12:06:11.192761 osd.43 (osd.43) 1676 : cluster [DBG] map
> e2342983 wrongly marked me down at e2342983
> 2020-01-15 12:06:32.240850 mon.mc-ceph-mon1 (mon.0) 533397 : cluster [DBG]
> osd.49 reported failed by osd.43
> 2020-01-15 12:06:32.241117 mon.mc-ceph-mon1 (mon.0) 533398 : cluster [DBG]
> osd.51 reported failed by osd.43
> 2020-01-15 12:06:32.241247 mon.mc-ceph-mon1 (mon.0) 533399 : cluster [DBG]
> osd.52 reported failed by osd.43
> 2020-01-15 12:06:32.241378 mon.mc-ceph-mon1 (mon.0) 533400 : cluster [DBG]
> osd.53 reported failed by osd.43
> 2020-01-15 12:06:32.241498 mon.mc-ceph-mon1 (mon.0) 533401 : cluster [DBG]
> osd.54 reported failed by osd.43
> 2020-01-15 12:06:32.241680 mon.mc-ceph-mon1 (mon.0) 533402 : cluster [DBG]
> osd.71 reported failed by osd.43
> 2020-01-15 12:06:33.374171 mon.mc-ceph-mon1 (mon.0) 533762 : cluster [DBG]
> osd.43 reported failed by osd.15
> 2020-01-15 12:06:33.713135 mon.mc-ceph-mon1 (mon.0) 534029 : cluster [DBG]
> osd.43 reported failed by osd.191
> 2020-01-15 12:06:33.713227 mon.mc-ceph-mon1 (mon.0) 534030 : cluster [INF]
> osd.43 failed (root=hdd,rack=8c-hdd,host=mc-8c-osd02-hdd) (2 reporters from
> different host after 20.002634 >= grace 20.001226)
> 2020-01-15 12:16:34.202137 mon.mc-ceph-mon1 (mon.0) 537464 : cluster [INF]
> Marking osd.43 out (has been down for 600 seconds)
> 2020-01-15 12:26:37.655911 mon.mc-ceph-mon1 (mon.0) 538134 : cluster [INF]
> osd.43 [v2:[*:*:*:5::12]:6802/1286742,v1:[*:*:*:5::12]:6808/1286742] boot
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD's hang after network blip

Reply via email to