Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
> ceph-induced, it looks like.  I don't remember seeing anything
> similar
> in the context of krbd.
> 
> This is a Xen dom0 kernel, right?  What did the workload look like?
> Can you provide dmesg before the crash?

Hi,

yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
0.94.10 Ceph (so, Hammer).

Before this error, I add this in logs :

Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message 
ffff88021a910200 data crc 2306836368 != exp. 2215155875
Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message 
ffff880220b10100 data crc 842840543 != exp. 2657161714
Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 
bad crc/signature
Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message 
ffff88025d1aa400 data crc 938978341 != exp. 4154366769
Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message 
ffff880260300400 data crc 2988747958 != exp. 319958859
Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 
bad crc/signature
Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message 
ffff8802208b8300 data crc 369498361 != exp. 906022772
Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 
bad crc/signature
Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message 
ffff880236f9a000 data crc 2586662963 != exp. 2886241494
Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 
bad crc/signature
Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message 
ffff88027e633c00 data crc 1102363051 != exp. 679243837
Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 
bad crc/signature
Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message 
ffff88024ce0a800 data crc 2467415342 != exp. 1753860323
Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 
bad crc/signature
Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 0000 
[#1] SMP


We had to switch to TCP Cubic (instead of badly configured TCP BBR, without 
FQ), to reduce the data crc errors.
But since we still had some errors, last night we rebooted all the OSD nodes in 
Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.

Since the last 7 hours, we haven't got any data crc errors from OSD, but we had 
one from a MON. Without hang/crash.

About the workload, the Xen VMs are mainly LAMP servers : http traffic, handle 
by nginx or apache, php, and MySQL databases.

Thanks,

Olivier
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to