Hello Mike, see my inline comments.
Am 14.08.19 um 02:09 schrieb Mike Christie: >>> ----- >>> Previous tests crashed in a reproducible manner with "-P 1" (single io >>> gzip/gunzip) after a few minutes up to 45 minutes. >>> >>> Overview of my tests: >>> >>> - SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, >>> 120s device timeout >>> -> 18 hour testrun was successful, no dmesg output >>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s >>> device timeout >>> -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io >>> errors, map/mount can be re-created without reboot >>> -> parallel krbd device usage with 99% io usage worked without a problem >>> while running the test >>> - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s >>> device timeout >>> -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io >>> errors, map/mount can be re-created >>> -> parallel krbd device usage with 99% io usage worked without a problem >>> while running the test >>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no >>> timeout >>> -> failed after < 10 minutes >>> -> system runs in a high system load, system is almost unusable, unable >>> to shutdown the system, hard reset of vm necessary, manual exclusive lock >>> removal is necessary before remapping the device There is something new compared to yesterday.....three days ago i downgraded a production system to client version 12.2.5. This night also this machine crashed. So it seems that rbd-nbd is broken in general also with release 12.2.5 and potentially before. The new (updated) list: *- FAILED: kernel 4.15, ceph 12.2.5, 2TB ec-volume, ext4 file system, 120s device timeout** ** -> crashed in production while snapshot trimming is running on that pool* - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created without reboot -> parallel krbd device usage with 99% io usage worked without a problem while running the test - FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created -> parallel krbd device usage with 99% io usage worked without a problem while running the test - FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout -> failed after < 10 minutes -> system runs in a high system load, system is almost unusable, unable to shutdown the system, hard reset of vm necessary, manual exclusive lock removal is necessary before remapping the device - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created - FAILED: kernel 5.0, ceph 12.2.12, 2TB ec-volume, ext4 file system, 120s device timeout -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, map/mount can be re-created >>> - FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, >>> 120s device timeout >>> -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io >>> errors, map/mount can be re-created >> How many CPUs and how much memory does the VM have? Charateristic of the crashed vm machine: * Ubuntu 18.04, with kernel 4.15, Ceph Client 12.2.5 * Services: NFS kernel Server, nothing else * Crash behavior: o daily Task for snapshot creation/deletion started at 19:00 o a daily database backup started at 19:00, this created + 120 IOPS write, and 1 IOPS read + 22K/sectors per second write, 0 sectors/per second + 97 MBIT inbound and 97 MBIT outbound network usage (nfs server) o we had slow requests at the time of the crash o rbd-nbd process terminated 25min later without segfault o the nfs usage created a 5 min load of 10 from start, 5K context switches/sec o memory usage (kernel+userspace) was 10% of the system o no swap usage * ceph.conf [client] rbd cache = true rbd cache size = 67108864 rbd cache max dirty = 33554432 rbd cache target dirty = 25165824 rbd cache max dirty age = 3 rbd readahead max bytes = 4194304 admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok * 4 CPUs * 6 GB RAM * Non default Sysctl Settings vm.swappiness = 1 fs.aio-max-nr = 262144 fs.file-max = 1000000 kernel.pid_max = 4194303 vm.zone_reclaim_mode = 0 kernel.randomize_va_space = 0 kernel.panic = 0 kernel.panic_on_oops = 0 >> I'm not sure which test it covers above, but for >> test-with-timeout/ceph-client.archiv.log and dmesg-crash it looks like >> the command that probably triggered the timeout got stuck in safe_write >> or write_fd, because we see: >> >> // Command completed and right after this log message we try to write >> the reply and data to the nbd.ko module. >> >> 2019-07-29 21:55:21.148118 7fffbf7fe700 20 rbd-nbd: writer_entry: got: >> [4500000000000000 READ 24043755000~20000 0] >> >> // We got stuck and 2 minutes go by and so the timeout fires. That kills >> the socket, so we get an error here and after that rbd-nbd is going to exit. >> >> 2019-07-29 21:57:21.785111 7fffbf7fe700 -1 rbd-nbd: [4500000000000000 >> READ 24043755000~20000 0]: failed to write replay data: (32) Broken pipe >> >> We could hit this in a couple ways: >> >> 1. The block layer sends a command that is larger than the socket's send >> buffer limits. These are those values you sometimes set in sysctl.conf like: >> >> net.core.rmem_max >> net.core.wmem_max >> net.core.rmem_default >> net.core.wmem_default >> net.core.optmem_max see attached file. >> There does not seem to be any checks/code to make sure there is some >> alignment with limits. I will send a patch but that will not help you >> right now. The max io size for nbd is 128k so make sure your net values >> are large enough. Increase the values in sysctl.conf and retry if they >> were too small. > Not sure what I was thinking. Just checked the logs and we have done IO > of the same size that got stuck and it was fine, so the socket sizes > should be ok. > > We still need to add code to make sure IO sizes and the af_unix sockets > size limits match up. > > >> 2. If memory is low on the system, we could be stuck trying to allocate >> memory in the kernel in that code path too. memory was definitely not low, we only had 10% memory usage at the time of the crash. >> rbd-nbd just uses more memory per device, so it could be why we do not >> see a problem with krbd. >> >> 3. I wonder if we are hitting a bug with PF_MEMALLOC Ilya hit with krbd. >> He removed that code from the krbd. I will ping him on that. Interesting. I activated Coredumps for that processes - probably we can find something interesting here... Regards Marc
sysctl_settings.txt.gz
Description: application/gzip
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com