From
https://forum.proxmox.com/threads/hyperconverged-cluster-logging-seemingly-random-crc-errors
We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's and
dual Melanox 25Gb/s sfp's) in a cluster. I have started noticing
apparently random crc errors in the osd logs.
Node B, osd.6
2025-10-23T10:32:59.808+0200 7f22a75bf700 0 bad crc in data 3330350463
!= exp 677417498 from v1:192.168.131.4:0/3121668685
192.168.131.4 is node D
Node B, osd.7
2025-10-23T09:35:12.995+0200 7fbcdbcd7700 0 bad crc in data 3922083958
!= exp 3479198006 from v1:192.168.131.2:0/2732728486
192.168.131.2 is node B, which is the node osd.7 is on.
and so there are others on other nodes and osd's. From what I understand
this means that data copied from some other osd to this one that logs
the error fails the crc test. However, I have taken (as a test) one of
these ssd's out of the cluster and it tests just fine. I put it back and
no crc errors are logged for it.
Question: Can something else we causing this? A network connector? It
seems pretty random so me, so how can I trace this sources of this?
It was suggested to me by die vendor tech support that turning
offloading off on the nic: /ethtool -K vmbr0 rx off tx off tso off gso
off gro off lro off./ That doesn't seem to make a difference however.
The next step would be to check for a MTU/jumbo mismatch.
Some details: ceph version 17.2.7
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 62.29883 root default
-3 15.17824 host FT1-NodeA
2 hdd 1.86029 osd.2 up 1.00000 1.00000
3 hdd 1.86029 osd.3 up 1.00000 1.00000
4 hdd 1.86029 osd.4 up 1.00000 1.00000
5 hdd 1.86589 osd.5 up 1.00000 1.00000
0 ssd 0.93149 osd.0 up 1.00000 1.00000
28 ssd 3.30690 osd.28 up 1.00000 1.00000
29 ssd 3.49309 osd.29 up 1.00000 1.00000
-7 16.10413 host FT1-NodeB
10 hdd 1.86029 osd.10 up 1.00000 1.00000
11 hdd 1.86029 osd.11 up 1.00000 1.00000
26 hdd 1.86029 osd.26 up 1.00000 1.00000
27 hdd 1.86029 osd.27 up 1.00000 1.00000
6 ssd 0.93149 osd.6 up 1.00000 1.00000
7 ssd 0.93149 osd.7 up 1.00000 1.00000
25 ssd 3.49309 osd.25 up 1.00000 1.00000
41 ssd 3.30690 osd.41 up 1.00000 1.00000
-10 15.84383 host FT1-NodeC
14 hdd 1.59999 osd.14 up 1.00000 1.00000
15 hdd 1.86029 osd.15 up 1.00000 1.00000
16 hdd 1.86029 osd.16 up 1.00000 1.00000
17 hdd 1.86029 osd.17 up 1.00000 1.00000
8 ssd 0.93149 osd.8 up 1.00000 1.00000
9 ssd 0.93149 osd.9 up 1.00000 1.00000
24 ssd 3.49309 osd.24 up 1.00000 1.00000
43 ssd 3.30690 osd.43 up 1.00000 1.00000
-13 15.17264 host FT1-NodeD
20 hdd 1.86029 osd.20 up 1.00000 1.00000
21 hdd 1.86029 osd.21 up 1.00000 1.00000
22 hdd 1.86029 osd.22 up 1.00000 1.00000
23 hdd 1.86029 osd.23 up 1.00000 1.00000
12 ssd 3.30690 osd.12 up 1.00000 1.00000
13 ssd 0.93149 osd.13 up 1.00000 1.00000
19 ssd 3.49309 osd.19 up 1.00000 1.00000
The spinners have their RocksDB on the nvme's, for extra performance,
but I have not noticed any crc errors prior to installing the 4TB
Samsungs in each node.
Suggestions are welcome.
thanks
Roland
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]