[ceph-users] OSD hanging on 12.2.12 by message worker

Max Vernimmen Thu, 06 Jun 2019 02:05:26 -0700

HI,


We are running VM images on ceph using RBD. We are seeing a problem where
one of our VMs gets into problems due to IO not completing. iostat on the
VM shows IO remaining in the queue, and disk utilisation for ceph based
vdisks is 100%.


Upon investigation the problem seems to be with the message worker
(msgr-worker-0) thread for one OSD. Restarting the OSD process fixes the
problem: the IO gets completed and the VM is unfrozen, life continues
without a problem. Until it happens again. Recurrence is between 30 minutes
and over 24 hours. The VMs affected are always in the same pool, but it's
not always the same VM that is affected. The problem occurs with different
VMs on different hypervisors. The problem occurs on different ceph nodes
and with different OSDs.


When the problem occurs, we see on the network a sudden jump from
<100mbit/sec to >4gbit/sec of continuous traffic. This traffic is between
the hypervisor and one OSD, in these cases always one of our HDD OSDs. The
traffic is not visible within the VM, only on the hypervisor.


If the client is rebooted,  the problem is gone. If the OSD  is restarted,
the problem is gone.

This is happening several times per day after we made several changes at
the same time:

   - add physical ram to the ceph nodes
   - move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv
   max' to 'bluestore cache autotune = 1' and 'osd memory target =
   20401094656'.
   - update ceph from 12.2.8 to 12.2.11
   - update clients from 12.2.8 to 12.2.11

We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix
this problem.


My request is that someone takes a look at our findings below and can give
some insight into whether this is a bug, a misconfiguration or perhaps some
idea of where to take a closer look.


our setup is:

8 identical nodes, each with 4 HDDs (8TB, 7k rpm) and  6 SSDs (4TB). There
are a number of pools that using crush rules  map to either the HDDs  or
the SSDs. The pool that always has this problem is called 'prod_slow' and
goes to the HDDs.


I tracked down the osd by looking at the client port the client receives
most traffic from (all 4gbps is read traffic, outgoing from ceph, incoming
to client).


root@ceph-03:~# netstat -tlpn|grep 6804

tcp        0      0 10.60.8.11:6804         0.0.0.0:*               LISTEN
    3741/ceph-osd

tcp        0      0 10.60.6.11:6804         0.0.0.0:*               LISTEN
    3730/ceph-osd


root@ceph-03:~# ps uafx|grep 3730

ceph        3730 44.3  6.3 20214604 16848524 ?   Ssl  Jun05 524:14
/usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph


root@ceph-03:~# ps -L -p3730

    PID     LWP TTY          TIME CMD

   3730    3730 ?        00:00:05 ceph-osd

   3730    3778 ?        00:00:00 log

   3730    3791 ?        05:19:49 msgr-worker-0

   3730    3802 ?        00:01:18 msgr-worker-1

   3730    3810 ?        00:01:25 msgr-worker-2

   3730    3842 ?        00:00:00 service

   3730    3845 ?        00:00:00 admin_socket

   3730    4015 ?        00:00:00 ceph-osd

   3730    4017 ?        00:00:00 safe_timer

   3730    4018 ?        00:00:03 safe_timer

   3730    4019 ?        00:00:00 safe_timer

   3730    4020 ?        00:00:00 safe_timer

   3730    4021 ?        00:00:14 bstore_aio

   3730    4023 ?        00:00:05 bstore_aio

   3730    4280 ?        00:00:32 rocksdb:bg0

   3730    4634 ?        00:00:00 dfin

   3730    4635 ?        00:00:12 finisher

   3730    4636 ?        00:00:51 bstore_kv_sync

   3730    4637 ?        00:00:12 bstore_kv_final

   3730    4638 ?        00:00:27 bstore_mempool

   3730    5803 ?        00:03:08 ms_dispatch

   3730    5804 ?        00:00:00 ms_local

   3730    5805 ?        00:00:00 ms_dispatch

   3730    5806 ?        00:00:00 ms_local

   3730    5807 ?        00:00:00 ms_dispatch

   3730    5808 ?        00:00:00 ms_local

   3730    5809 ?        00:00:00 ms_dispatch

   3730    5810 ?        00:00:00 ms_local

   3730    5811 ?        00:00:00 ms_dispatch

   3730    5812 ?        00:00:00 ms_local

   3730    5813 ?        00:00:00 ms_dispatch

   3730    5814 ?        00:00:00 ms_local

   3730    5815 ?        00:00:00 ms_dispatch

   3730    5816 ?        00:00:00 ms_local

   3730    5817 ?        00:00:00 safe_timer

   3730    5818 ?        00:00:00 fn_anonymous

   3730    5819 ?        00:00:02 safe_timer

   3730    5820 ?        00:00:00 tp_peering

   3730    5821 ?        00:00:00 tp_peering

   3730    5822 ?        00:00:00 fn_anonymous

   3730    5823 ?        00:00:00 fn_anonymous

   3730    5824 ?        00:00:00 safe_timer

   3730    5825 ?        00:00:00 safe_timer

   3730    5826 ?        00:00:00 safe_timer

   3730    5827 ?        00:00:00 safe_timer

   3730    5828 ?        00:00:00 osd_srv_agent

   3730    5829 ?        00:01:15 tp_osd_tp

   3730    5830 ?        00:01:27 tp_osd_tp

   3730    5831 ?        00:01:40 tp_osd_tp

   3730    5832 ?        00:00:49 tp_osd_tp

   3730    5833 ?        03:09:49 tp_osd_tp

   3730    5834 ?        00:00:00 tp_osd_remove

   3730    5835 ?        00:00:00 tp_osd_recovery

   3730    5836 ?        00:00:00 tp_osd_cmd

   3730    5837 ?        00:00:06 osd_srv_heartbt

   3730    5950 ?        00:00:00 signal_handler

   3730  140250 ?        00:00:02 rocksdb:bg0

root@ceph-03:~#


stracing those shows a lot of details I'll add at  the bottom of this post.

after increasing the debug logging on the OSD (and finally shutting down
the OSD), the log shows:

2019-06-06 09:25:43.647703 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.649637 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00

2019-06-06 09:25:43.651376 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.653280 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00

2019-06-06 09:25:43.655156 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.656798 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00

2019-06-06 09:25:43.658183 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.659937 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00

2019-06-06 09:25:43.661915 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780

2019-06-06 09:25:43.664783 7fde9b1dc700  2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00


'the ms_handle_reset con' messages appear about 500 times per second. This
continues until we restart the OSD. Once the OSD is up again, it starts to
recover degraded PGs and all is fine.

I see the traffic isn't actually coming from the harddisk itself. When
checking on the ceph node the HDD is doing very little. I think this
traffic is either coming from the cache or maybe it's just messaging
between the client and ceph, where the client retries to get the data over
and over.


A strace of the process during problem state is here:
https://gist.github.com/mvernimmen/5fe39d4722aa8dcff8294285460db021

ceph.conf is here:
https://gist.github.com/mvernimmen/91f21b5ab8274920017fa2de743a5743


This is where my knowledge on debugging starts to run thin. I'd love to
learn how to continue, but I'm in need of some help here. Thank you for
your time!



Best regards,



-- 
Max Vernimmen
Senior DevOps Engineer
Textkernel

------------------------------------------------------------
------------------
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
-----------------------------------------------------------
------------------

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] OSD hanging on 12.2.12 by message worker

Reply via email to