Hi,
On 11/25/2015 06:41 PM, Robert LeBlanc wrote:
Since the one that is different is not your primary for the pg, then
pg repair is safe.
Ok, that's clear thanks.
I think we managed to identify the root cause of the scrubbing errors
even if the files are identical.
It seems to be a hardware issue (faulty RAM module), which is really
hard to detect, even if you have an ECC capable module.
The glitch happens here:
*node2:~# while true; do sha1sum
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1;
sleep 0.1; done**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**...
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**acd62deb72530e22b7ebdce3e2e47e0480af533b
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**4163ca9a76ed7b0b9f0e69ab5a1793cd1cf7d1c4
/var/lib/ceph/osd/ceph-12/current/1.14e_head/DIR_E/DIR_4/DIR_D/rb.0.532aa.238e1f29.000000016ce4__head_DE208D4E__1**
**....*
So, sometimes it calculates different values. We managed to copy this
file several times to find the difference:
*# diff 48.bin 49.bin **
**40095c40095**
**<
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtPvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC**
**---**
**>
hMxs8+iPzA5BRi/Nq4iuovTkR/Q9RXV15qgHTiGO6jtTvjT5bdQFZQH8BuCP65E4JDmn8yC7/laC*
So, it has a single bit difference (0x50 vs 0x54)
I think this presentation could be very useful about the silent
corruption of data:
https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf
We will test all of our RAM modules now (it should have happened before,
of course...), but it seems you have to be very careful with the cheap
commodity hardware.
Regards,
Csaba
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com