On Thu, Oct 5, 2017 at 12:01 PM, Olivier Bonvalet <ceph.l...@daevel.fr> wrote:
> Le jeudi 05 octobre 2017 à 11:47 +0200, Ilya Dryomov a écrit :
>> The stable pages bug manifests as multiple sporadic connection
>> resets,
>> because in that case CRCs computed by the kernel don't always match
>> the
>> data that gets sent out.  When the mismatch is detected on the OSD
>> side, OSDs reset the connection and you'd see messages like
>>
>>   libceph: osd1 1.2.3.4:6800 socket closed (con state OPEN)
>>   libceph: osd2 1.2.3.4:6804 socket error on write
>>
>> This is a different issue.  Josy, Adrian, Olivier, do you also see
>> messages of the "libceph: read_partial_message ..." type or is it
>> just
>> "libceph: ... bad crc/signature" errors?
>
> I have "read_partial_message" too, for example :
>
> Oct  5 09:00:47 lorunde kernel: [65575.969322] libceph: read_partial_message 
> ffff88027c231500 data crc 181941039 != exp. 115232978
> Oct  5 09:00:47 lorunde kernel: [65575.969953] libceph: osd122 10.0.0.31:6800 
> bad crc/signature
> Oct  5 09:04:30 lorunde kernel: [65798.958344] libceph: read_partial_message 
> ffff880254a25c00 data crc 443114996 != exp. 2014723213
> Oct  5 09:04:30 lorunde kernel: [65798.959044] libceph: osd18 10.0.0.22:6802 
> bad crc/signature
> Oct  5 09:14:28 lorunde kernel: [66396.788272] libceph: read_partial_message 
> ffff880238636200 data crc 1797729588 != exp. 2550563968
> Oct  5 09:14:28 lorunde kernel: [66396.788984] libceph: osd43 10.0.0.9:6804 
> bad crc/signature
> Oct  5 10:09:36 lorunde kernel: [69704.211672] libceph: read_partial_message 
> ffff8802712dff00 data crc 2241944833 != exp. 762990605
> Oct  5 10:09:36 lorunde kernel: [69704.212422] libceph: osd103 10.0.0.28:6804 
> bad crc/signature
> Oct  5 10:25:41 lorunde kernel: [70669.203596] libceph: read_partial_message 
> ffff880257521400 data crc 3655331946 != exp. 2796991675
> Oct  5 10:25:41 lorunde kernel: [70669.204462] libceph: osd16 10.0.0.21:6806 
> bad crc/signature
> Oct  5 10:25:52 lorunde kernel: [70680.255943] libceph: read_partial_message 
> ffff880245e3d600 data crc 3787567693 != exp. 725251636
> Oct  5 10:25:52 lorunde kernel: [70680.257066] libceph: osd60 10.0.0.23:6800 
> bad crc/signature

OK, so both your and Josy's cases are actually the reverse: the kernel
detects the mismatch, so it's definitely not stable pages related.

When did you start seeing these errors?  Can you correlate that to
a ceph or kernel upgrade?  If not, and if you don't see other issues,
I'd write it off as faulty hardware.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to