HI,
We are running VM images on ceph using RBD. We are seeing a problem where
one of our VMs gets into problems due to IO not completing. iostat on the
VM shows IO remaining in the queue, and disk utilisation for ceph based
vdisks is 100%.
Upon investigation the problem seems to be with the message worker
(msgr-worker-0) thread for one OSD. Restarting the OSD process fixes the
problem: the IO gets completed and the VM is unfrozen, life continues
without a problem. Until it happens again. Recurrence is between 30 minutes
and over 24 hours. The VMs affected are always in the same pool, but it's
not always the same VM that is affected. The problem occurs with different
VMs on different hypervisors. The problem occurs on different ceph nodes
and with different OSDs.
When the problem occurs, we see on the network a sudden jump from
<100mbit/sec to >4gbit/sec of continuous traffic. This traffic is between
the hypervisor and one OSD, in these cases always one of our HDD OSDs. The
traffic is not visible within the VM, only on the hypervisor.
If the client is rebooted, the problem is gone. If the OSD is restarted,
the problem is gone.
This is happening several times per day after we made several changes at
the same time:
- add physical ram to the ceph nodes
- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv
max' to 'bluestore cache autotune = 1' and 'osd memory target =
20401094656'.
- update ceph from 12.2.8 to 12.2.11
- update clients from 12.2.8 to 12.2.11
We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix
this problem.
My request is that someone takes a look at our findings below and can give
some insight into whether this is a bug, a misconfiguration or perhaps some
idea of where to take a closer look.
our setup is:
8 identical nodes, each with 4 HDDs (8TB, 7k rpm) and 6 SSDs (4TB). There
are a number of pools that using crush rules map to either the HDDs or
the SSDs. The pool that always has this problem is called 'prod_slow' and
goes to the HDDs.
I tracked down the osd by looking at the client port the client receives
most traffic from (all 4gbps is read traffic, outgoing from ceph, incoming
to client).
root@ceph-03:~# netstat -tlpn|grep 6804
tcp 0 0 10.60.8.11:6804 0.0.0.0:* LISTEN
3741/ceph-osd
tcp 0 0 10.60.6.11:6804 0.0.0.0:* LISTEN
3730/ceph-osd
root@ceph-03:~# ps uafx|grep 3730
ceph 3730 44.3 6.3 20214604 16848524 ? Ssl Jun05 524:14
/usr/bin/ceph-osd -f --cluster ceph --id 23 --setuser ceph --setgroup ceph
root@ceph-03:~# ps -L -p3730
PID LWP TTY TIME CMD
3730 3730 ? 00:00:05 ceph-osd
3730 3778 ? 00:00:00 log
3730 3791 ? 05:19:49 msgr-worker-0
3730 3802 ? 00:01:18 msgr-worker-1
3730 3810 ? 00:01:25 msgr-worker-2
3730 3842 ? 00:00:00 service
3730 3845 ? 00:00:00 admin_socket
3730 4015 ? 00:00:00 ceph-osd
3730 4017 ? 00:00:00 safe_timer
3730 4018 ? 00:00:03 safe_timer
3730 4019 ? 00:00:00 safe_timer
3730 4020 ? 00:00:00 safe_timer
3730 4021 ? 00:00:14 bstore_aio
3730 4023 ? 00:00:05 bstore_aio
3730 4280 ? 00:00:32 rocksdb:bg0
3730 4634 ? 00:00:00 dfin
3730 4635 ? 00:00:12 finisher
3730 4636 ? 00:00:51 bstore_kv_sync
3730 4637 ? 00:00:12 bstore_kv_final
3730 4638 ? 00:00:27 bstore_mempool
3730 5803 ? 00:03:08 ms_dispatch
3730 5804 ? 00:00:00 ms_local
3730 5805 ? 00:00:00 ms_dispatch
3730 5806 ? 00:00:00 ms_local
3730 5807 ? 00:00:00 ms_dispatch
3730 5808 ? 00:00:00 ms_local
3730 5809 ? 00:00:00 ms_dispatch
3730 5810 ? 00:00:00 ms_local
3730 5811 ? 00:00:00 ms_dispatch
3730 5812 ? 00:00:00 ms_local
3730 5813 ? 00:00:00 ms_dispatch
3730 5814 ? 00:00:00 ms_local
3730 5815 ? 00:00:00 ms_dispatch
3730 5816 ? 00:00:00 ms_local
3730 5817 ? 00:00:00 safe_timer
3730 5818 ? 00:00:00 fn_anonymous
3730 5819 ? 00:00:02 safe_timer
3730 5820 ? 00:00:00 tp_peering
3730 5821 ? 00:00:00 tp_peering
3730 5822 ? 00:00:00 fn_anonymous
3730 5823 ? 00:00:00 fn_anonymous
3730 5824 ? 00:00:00 safe_timer
3730 5825 ? 00:00:00 safe_timer
3730 5826 ? 00:00:00 safe_timer
3730 5827 ? 00:00:00 safe_timer
3730 5828 ? 00:00:00 osd_srv_agent
3730 5829 ? 00:01:15 tp_osd_tp
3730 5830 ? 00:01:27 tp_osd_tp
3730 5831 ? 00:01:40 tp_osd_tp
3730 5832 ? 00:00:49 tp_osd_tp
3730 5833 ? 03:09:49 tp_osd_tp
3730 5834 ? 00:00:00 tp_osd_remove
3730 5835 ? 00:00:00 tp_osd_recovery
3730 5836 ? 00:00:00 tp_osd_cmd
3730 5837 ? 00:00:06 osd_srv_heartbt
3730 5950 ? 00:00:00 signal_handler
3730 140250 ? 00:00:02 rocksdb:bg0
root@ceph-03:~#
stracing those shows a lot of details I'll add at the bottom of this post.
after increasing the debug logging on the OSD (and finally shutting down
the OSD), the log shows:
2019-06-06 09:25:43.647703 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.649637 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.651376 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.653280 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.655156 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.656798 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.658183 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.659937 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00
2019-06-06 09:25:43.661915 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dd1ba1c800 session 0x55dbc1918780
2019-06-06 09:25:43.664783 7fde9b1dc700 2 osd.23 12939 ms_handle_reset con
0x55dc9eb08800 session 0x55dbc1918a00
'the ms_handle_reset con' messages appear about 500 times per second. This
continues until we restart the OSD. Once the OSD is up again, it starts to
recover degraded PGs and all is fine.
I see the traffic isn't actually coming from the harddisk itself. When
checking on the ceph node the HDD is doing very little. I think this
traffic is either coming from the cache or maybe it's just messaging
between the client and ceph, where the client retries to get the data over
and over.
A strace of the process during problem state is here:
https://gist.github.com/mvernimmen/5fe39d4722aa8dcff8294285460db021
ceph.conf is here:
https://gist.github.com/mvernimmen/91f21b5ab8274920017fa2de743a5743
This is where my knowledge on debugging starts to run thin. I'd love to
learn how to continue, but I'm in need of some help here. Thank you for
your time!
Best regards,
--
Max Vernimmen
Senior DevOps Engineer
Textkernel
------------------------------------------------------------
------------------
Textkernel BV, Nieuwendammerkade 26/a5, 1022 AB, Amsterdam, NL
-----------------------------------------------------------
------------------
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com