Hello Jason,

it seems that there is something wrong in the rbd-nbd implementation.
(added this information also at  https://tracker.ceph.com/issues/40822)

The problem not seems to be related to kernel releases, filesystem types or the 
ceph and network setup.
Release 12.2.5 seems to work properly, and at least releases >= 12.2.10 seems 
to have the described problem.

This night a 18 hour testrun with the following procedure was successful:
-----
#!/bin/bash
set -x
while true; do
   date
   find /srv_ec -type f -name "*.MYD" -print0 |head -n 50|xargs -0 -P 10 -n 2 
gzip -v
   date
   find /srv_ec -type f -name "*.MYD.gz" -print0 |head -n 50|xargs -0 -P 10 -n 
2 gunzip -v
done
-----
Previous tests crashed in a reproducible manner with "-P 1" (single io 
gzip/gunzip) after a few minutes up to 45 minutes.

Overview of my tests:

- SUCCESSFUL: kernel 4.15, ceph 12.2.5, 1TB ec-volume, ext4 file system, 120s 
device timeout
  -> 18 hour testrun was successful, no dmesg output
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s device 
timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created without reboot
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.15, ceph 12.2.11, 2TB ec-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created
  -> parallel krbd device usage with 99% io usage worked without a problem 
while running the test
- FAILED: kernel 4.4, ceph 12.2.11, 2TB ec-volume, xfs file system, no timeout
  -> failed after < 10 minutes
  -> system runs in a high system load, system is almost unusable, unable to 
shutdown the system, hard reset of vm necessary, manual exclusive lock removal 
is necessary before remapping the device
- FAILED: kernel 4.4, ceph 12.2.11, 2TB 3-replica-volume, xfs file system, 120s 
device timeout
  -> failed after < 1 hour, rbd-nbd map/device is gone, mount throws io errors, 
map/mount can be re-created

All device timeouts were set separately set by the nbd_set_ioctl tool because 
luminous rbd-nbd does not provide the possibility to define timeouts.

Whats next? Is i a good idea to do a binary search between 12.2.12 and 12.2.5?

From my point of view (without in depth-knowledge of rbd-nbd/librbd) my 
assumption is that this problem might be caused by rbd-nbd code and not by 
librbd.
The probability that a bug like this survives uncovered in librbd for such a 
long time seems to be low for me :-)

Regards
Marc

Am 29.07.19 um 22:25 schrieb Marc Schöchlin:
> Hello Jason,
>
> i updated the ticket https://tracker.ceph.com/issues/40822
>
> Am 24.07.19 um 19:20 schrieb Jason Dillaman:
>> On Wed, Jul 24, 2019 at 12:47 PM Marc Schöchlin <m...@256bit.org> wrote:
>>> Testing with a 10.2.5 librbd/rbd-nbd ist currently not that easy for me, 
>>> because the ceph apt source does not contain that version.
>>> Do you know a package source?
>> All the upstream packages should be available here [1], including 12.2.5.
> Ah okay, i will test this tommorow.
>> Did you pull the OSD blocked ops stats to figure out what is going on
>> with the OSDs?
> Yes, see referenced data in the ticket 
> https://tracker.ceph.com/issues/40822#note-15
>
> Regards
> Marc
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to