Do you have some idea how I can diagnose this problem?
I'll look at ceph -s output while you get these stuck process to see
if there's any unusual activity (scrub/deep
scrub/recovery/bacfills/...). Is it correlated in any way with rbd
removal (ie: write blocking don't appear unless you removed at least
one rbd for say one hour before the write performance problems).
I'm not familiar with Amazon VMs. If you map the rbds using the kernel
driver to local block devices do you have control over the kernel you
run (I've seen reports of various problems with older kernels and you
probably want the latest possible) ?
ceph status shows nothing unusual. However, on the problematic node, we
typically see entries in ps like this:
1468 12329 root D 0.0 mkfs.ext4 wait_on_page_bit
1468 12332 root D 0.0 mkfs.ext4 wait_on_buffer
Notice the "D" blocking state. Here, mkfs is stopped on some wait
functions for long periods of time. (Also, we are formatting the RBDs as
ext4 even though the OSDs are xfs; I assume this shouldn't be a problem?)
We're on kernel 3.18.4pl2, which is pretty recent. Still, an outdated
kernel driver isn't out of the question; if anyone has any concrete
information, I'd be grateful.
Jeff
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com