I really not sure why monitor mark the OSD to down state

 "Monitor daemon marked osd.9 down, but it is still running"

2018-01-30 16:07 GMT+08:00 blackpiglet J. <blackpigletbr...@gmail.com>:

> Guys,
>
> We had set up a five nodes Ceph cluster. Four are OSD servers and the
> other one is MON and MGR.
> Recently, during RGW stability test, RGW default
> pool: default.rgw.buckets.data is accidentally written full. As a result,
> RGW is stuck. We don't know the exact steps to recover, then we deleted all
> default RGW pools directly. After RGW restarted, all pools are back.
>
> Then I found 4 OSD processes are down. I am not whether this is related to
> our RGW pool delete operation.
>
> This a the log I think may be useful. The full version is in the
> attachment.
>
> Any help is appreciated. Thanks in advance.
>
>
> 2018-01-29 20:17:48.896544 7fc111122700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
> 2018-01-29 20:17:52.360069 7fc110183700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
> 2018-01-29 20:17:57.360192 7fc110183700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
> 2018-01-29 20:18:02.360315 7fc110183700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
> 2018-01-29 20:18:06.029867 7fc0fb912700  1 heartbeat_map reset_timeout
> 'OSD::osd_op_tp thread 0x7fc0fb912700' had timed out after 15
> 2018-01-29 20:18:06.030576 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.18( v
> 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c
> 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481
> pi=[2470,2481)/2 luod=
> 0'0 crt=2276'10 lcod 2151'9 mlcod 0'0 active] start_peering_interval up
> [9,51] -> [9,91,51], acting [9,51] -> [9,91,51], acting_primary 9 -> 9,
> up_primary 9 -> 9, role 0 -> 0, features acting 2305244844532236283
> upacting 230524484453223
> 6283
> 2018-01-29 20:18:06.030774 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.18( v
> 2276'10 (0'0,2276'10] local-lis/les=2479/2480 n=3 ec=704/704 lis/c
> 2479/2470 les/c/f 2480/2471/2312 2481/2481/2476) [9,91,51] r=0 lpr=2481
> pi=[2470,2481)/2 crt=2
> 276'10 lcod 2151'9 mlcod 0'0 unknown] state<Start>: transitioning to
> Primary
> 2018-01-29 20:18:06.031236 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.1d( v
> 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c
> 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481
> pi=[2470,2481)/2 luo
> d=0'0 crt=2316'568 lcod 2312'567 active] start_peering_interval up [79,9]
> -> [91,79,9], acting [79,9] -> [91,79,9], acting_primary 79 -> 91,
> up_primary 79 -> 91, role 1 -> 2, features acting 2305244844532236283
> upacting 2305244844532236
> 283
> 2018-01-29 20:18:06.031324 7fc0fc914700  1 osd.9 pg_epoch: 2481 pg[6.1d( v
> 2316'568 (0'0,2316'568] local-lis/les=2479/2480 n=2 ec=704/704 lis/c
> 2479/2470 les/c/f 2480/2471/2312 2481/2481/2481) [91,79,9] r=2 lpr=2481
> pi=[2470,2481)/2 crt
> =2316'568 lcod 2312'567 unknown NOTIFY] state<Start>: transitioning to
> Stray
> 2018-01-29 20:18:06.032167 7fc112124700  0 -- 10.1.248.1:6850/6003656 >>
> 10.1.248.4:6825/1003558 conn(0x556089bf4800 :6850 
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
> pgs=5245 cs=1 l=0).handle_connect_reply connect got RESETSESSION
> 2018-01-29 20:18:06.032184 7fc111923700  0 -- 10.1.248.1:6850/6003656 >>
> 10.1.248.3:6827/3726 conn(0x5560c0c06000 :-1 
> s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH
> pgs=6734 cs=1 l=0).handle_connect_reply connect got RESETSESSION
> 2018-01-29 20:18:06.053899 7fc103121700  0 log_channel(cluster) log [WRN]
> : Monitor daemon marked osd.9 down, but it is still running
> 2018-01-29 20:18:06.053917 7fc103121700  0 log_channel(cluster) log [DBG]
> : map e2485 wrongly marked me down at e2483
> 2018-01-29 20:18:06.053922 7fc103121700  0 osd.9 2485 _committed_osd_maps
> marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds,
> shutting down
> 2018-01-29 20:18:06.053927 7fc103121700  1 osd.9 2485
> start_waiting_for_healthy
> 2018-01-29 20:18:06.064250 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[10.53(
> empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f
> 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0
> active] start_peer
> ing_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1,
> up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283
> upacting 2305244844532236283
> 2018-01-29 20:18:06.064258 7fc0fd115700  1 osd.9 pg_epoch: 2483 pg[3.16( v
> 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c
> 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483
> pi=[2476,2483)/1 luo
> d=0'0 crt=2284'372 lcod 2284'371 active] start_peering_interval up
> [33,9,85] -> [33,85], acting [33,9,85] -> [33,85], acting_primary 33 -> 33,
> up_primary 33 -> 33, role 1 -> -1, features acting 2305244844532236283
> upacting 2305244844532
> 236283
> 2018-01-29 20:18:06.064493 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[10.53(
> empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f
> 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0
> unknown NOTIFY] st
> ate<Start>: transitioning to Stray
> 2018-01-29 20:18:06.064545 7fc0fd115700  1 osd.9 pg_epoch: 2485 pg[3.16( v
> 2284'372 (0'0,2284'372] local-lis/les=2476/2477 n=2 ec=1758/130 lis/c
> 2476/2476 les/c/f 2477/2477/2312 2483/2483/1760) [33,85] r=-1 lpr=2483
> pi=[2476,2483)/1 crt
> =2284'372 lcod 2284'371 unknown NOTIFY] state<Start>: transitioning to
> Stray
> 2018-01-29 20:18:06.064886 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[10.3f(
> empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f
> 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0
> active] start_peer
> ing_interval up [9] -> [], acting [9] -> [], acting_primary 9 -> -1,
> up_primary 9 -> -1, role 0 -> -1, features acting 2305244844532236283
> upacting 2305244844532236283
> 2018-01-29 20:18:06.064972 7fc0fd115700  1 osd.9 pg_epoch: 2483 pg[6.42( v
> 2298'367 (0'0,2298'367] local-lis/les=2476/2477 n=1 ec=1753/704 lis/c
> 2476/2476 les/c/f 2477/2477/2312 2483/2483/2367) [24,84] r=-1 lpr=2483
> pi=[2476,2483)/1 luo
> d=0'0 crt=2298'367 lcod 2298'366 active] start_peering_interval up
> [24,84,9] -> [24,84], acting [24,84,9] -> [24,84], acting_primary 24 -> 24,
> up_primary 24 -> 24, role 2 -> -1, features acting 2305244844532236283
> upacting 2305244844532
> 236283
> 2018-01-29 20:18:06.065091 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[10.3f(
> empty local-lis/les=2476/2477 n=0 ec=1815/1815 lis/c 2476/2476 les/c/f
> 2477/2477/2312 2483/2483/2483) [] r=-1 lpr=2483 pi=[2476,2483)/1 crt=0'0
> unknown NOTIFY] st
> ate<Start>: transitioning to Stray
>
> 2018-01-29 20:18:06.072202 7fc0fc914700  1 osd.9 pg_epoch: 2483 pg[5.3c( v
> 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c
> 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483
> pi=[2476,2483)/1 luod=
> 0'0 crt=2312'24 lcod 2312'23 active] start_peering_interval up [9,32,66]
> -> [32,66], acting [9,32,66] -> [32,66], acting_primary 9 -> 32, up_primary
> 9 -> 32, role 0 -> -1, features acting 2305244844532236283 upacting
> 2305244844532236283
> 2018-01-29 20:18:06.072324 7fc0fc914700  1 osd.9 pg_epoch: 2485 pg[5.3c( v
> 2312'24 (0'0,2312'24] local-lis/les=2476/2477 n=1 ec=1748/700 lis/c
> 2476/2476 les/c/f 2477/2477/2312 2483/2483/2483) [32,66] r=-1 lpr=2483
> pi=[2476,2483)/1 crt=2
> 312'24 lcod 2312'23 unknown NOTIFY] state<Start>: transitioning to Stray
> 2018-01-29 20:18:06.072463 7fc103121700  0 osd.9 2485 _committed_osd_maps
> shutdown OSD via async signal
> 2018-01-29 20:18:06.072570 7fc0f4904700 -1 Fail to open '/proc/0/cmdline'
> error = (2) No such file or directory
> 2018-01-29 20:18:06.072597 7fc0f4904700 -1 received  signal: Interrupt
> from  PID: 0 task name: <unknown> UID: 0
> 2018-01-29 20:18:06.072603 7fc0f4904700 -1 osd.9 2485 *** Got signal
> Interrupt ***
> 2018-01-29 20:18:06.072609 7fc0f4904700  0 osd.9 2485 prepare_to_stop
> starting shutdown
> 2018-01-29 20:18:06.072614 7fc0f4904700 -1 osd.9 2485 shutdown
> 2018-01-29 20:18:06.429238 7fc10d936700  0 log_channel(cluster) log [WRN]
> : 2 slow requests, 2 included below; oldest blocked for > 55.682037 secs
> 2018-01-29 20:18:06.429251 7fc10d936700  0 log_channel(cluster) log [WRN]
> : slow request 55.682037 seconds old, received at 2018-01-29
> 20:17:10.747125: pg_notify((query:2482 sent:2482 6.18( v 2276'10
> (0'0,2276'10] local-lis/les=2473/247
> 4 n=3 ec=704/704 lis/c 2473/2470 les/c/f 2474/2471/2312
> 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting
> 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map
> 2018-01-29 20:18:06.429265 7fc10d936700  0 log_channel(cluster) log [WRN]
> : slow request 55.130508 seconds old, received at 2018-01-29
> 20:17:11.298655: pg_notify((query:2482 sent:2482 6.18( v 2276'10
> (0'0,2276'10] local-lis/les=2479/248
> 0 n=3 ec=704/704 lis/c 2479/2470 les/c/f 2480/2471/2312
> 2481/2481/2476))=([2470,2480] intervals=([2473,2475] acting
> 51,91),([2479,2480] acting 9,51)) epoch 2482) currently wait for new map
> 2018-01-29 20:18:08.117736 7fc0f4904700  1 bluestore(/var/lib/ceph/osd/ceph-9)
> umount
> 2018-01-29 20:18:08.317882 7fc0f4904700  1 stupidalloc shutdown
> 2018-01-29 20:18:08.322660 7fc0f4904700  1 freelist shutdown
> 2018-01-29 20:18:08.322718 7fc0f4904700  4 rocksdb:
> [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:217] Shutdown: canceling
> all background work
> 2018-01-29 20:18:08.339096 7fc0f4904700  4 rocksdb:
> [/build/ceph-12.2.2/src/rocksdb/db/db_impl.cc:343] Shutdown complete
> 2018-01-29 20:18:08.508354 7fc0f4904700  1 bluefs umount
> 2018-01-29 20:18:08.508555 7fc0f4904700  1 stupidalloc shutdown
> 2018-01-29 20:18:08.508571 7fc0f4904700  1 stupidalloc shutdown
> 2018-01-29 20:18:08.508573 7fc0f4904700  1 stupidalloc shutdown
> 2018-01-29 20:18:08.508718 7fc0f4904700  1 bdev(0x5560701c7440
> /dev/nvme0n1p19) close
> 2018-01-29 20:18:08.786803 7fc0f4904700  1 bdev(0x5560701c6fc0
> /dev/nvme0n1p20) close
> 2018-01-29 20:18:08.874747 7fc0f4904700  1 bdev(0x5560701c7200
> /var/lib/ceph/osd/ceph-9/block) close
> 2018-01-29 20:18:09.030750 7fc0f4904700  1 bdev(0x5560701c6d80
> /var/lib/ceph/osd/ceph-9/block) close
>
> BR,
> Bruce J.
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to