[ceph-users] crc errors on 4 node cluster

Roland Giesler Mon, 03 Nov 2025 04:27:23 -0800

Fromhttps://forum.proxmox.com/threads/hyperconverged-cluster-logging-seemingly-random-crc-errors

We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's anddual Melanox 25Gb/s sfp's) in a cluster. I have started noticingapparently random crc errors in the osd logs.


Node B, osd.6

2025-10-23T10:32:59.808+0200 7f22a75bf700 0 bad crc in data 3330350463!= exp 677417498 from v1:192.168.131.4:0/3121668685

192.168.131.4 is node D

Node B, osd.7

2025-10-23T09:35:12.995+0200 7fbcdbcd7700 0 bad crc in data 3922083958!= exp 3479198006 from v1:192.168.131.2:0/2732728486

192.168.131.2 is node B, which is the node osd.7 is on.

and so there are others on other nodes and osd's. From what I understandthis means that data copied from some other osd to this one that logsthe error fails the crc test. However, I have taken (as a test) one ofthese ssd's out of the cluster and it tests just fine. I put it back andno crc errors are logged for it.

Question: Can something else we causing this? A network connector? Itseems pretty random so me, so how can I trace this sources of this?

It was suggested to me by die vendor tech support that turningoffloading off on the nic: /ethtool -K vmbr0 rx off tx off tso off gsooff gro off lro off./ That doesn't seem to make a difference however.


The next step would be to check for a MTU/jumbo mismatch.


Some details: ceph version 17.2.7

# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME  STATUS  REWEIGHT  PRI-AFF
 -1         62.29883  root default
 -3         15.17824      host FT1-NodeA
  2    hdd   1.86029          osd.2           up   1.00000 1.00000
  3    hdd   1.86029          osd.3           up   1.00000 1.00000
  4    hdd   1.86029          osd.4           up   1.00000 1.00000
  5    hdd   1.86589          osd.5           up   1.00000 1.00000
  0    ssd   0.93149          osd.0           up   1.00000 1.00000
 28    ssd   3.30690          osd.28          up   1.00000 1.00000
 29    ssd   3.49309          osd.29          up   1.00000 1.00000
 -7         16.10413      host FT1-NodeB
 10    hdd   1.86029          osd.10          up   1.00000 1.00000
 11    hdd   1.86029          osd.11          up   1.00000 1.00000
 26    hdd   1.86029          osd.26          up   1.00000 1.00000
 27    hdd   1.86029          osd.27          up   1.00000 1.00000
  6    ssd   0.93149          osd.6           up   1.00000 1.00000
  7    ssd   0.93149          osd.7           up   1.00000 1.00000
 25    ssd   3.49309          osd.25          up   1.00000 1.00000
 41    ssd   3.30690          osd.41          up   1.00000 1.00000
-10         15.84383      host FT1-NodeC
 14    hdd   1.59999          osd.14          up   1.00000 1.00000
 15    hdd   1.86029          osd.15          up   1.00000 1.00000
 16    hdd   1.86029          osd.16          up   1.00000 1.00000
 17    hdd   1.86029          osd.17          up   1.00000 1.00000
  8    ssd   0.93149          osd.8           up   1.00000 1.00000
  9    ssd   0.93149          osd.9           up   1.00000 1.00000
 24    ssd   3.49309          osd.24          up   1.00000 1.00000
 43    ssd   3.30690          osd.43          up   1.00000 1.00000
-13         15.17264      host FT1-NodeD
 20    hdd   1.86029          osd.20          up   1.00000 1.00000
 21    hdd   1.86029          osd.21          up   1.00000 1.00000
 22    hdd   1.86029          osd.22          up   1.00000 1.00000
 23    hdd   1.86029          osd.23          up   1.00000 1.00000
 12    ssd   3.30690          osd.12          up   1.00000 1.00000
 13    ssd   0.93149          osd.13          up   1.00000 1.00000
 19    ssd   3.49309          osd.19          up   1.00000 1.00000

The spinners have their RocksDB on the nvme's, for extra performance,but I have not noticed any crc errors prior to installing the 4TBSamsungs in each node.


Suggestions are welcome.

thanks

Roland
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] crc errors on 4 node cluster

Reply via email to