From https://forum.proxmox.com/threads/hyperconverged-cluster-logging-seemingly-random-crc-errors

We have 4 nodes (dual Xeon CPU's, 256G RAM, 4 NVMe SSD's, 4 HDD's and dual Melanox 25Gb/s sfp's) in a cluster. I have started noticing apparently random crc errors in the osd logs.

Node B, osd.6
2025-10-23T10:32:59.808+0200 7f22a75bf700  0 bad crc in data 3330350463 != exp 677417498 from v1:192.168.131.4:0/3121668685
192.168.131.4 is node D

Node B, osd.7
2025-10-23T09:35:12.995+0200 7fbcdbcd7700  0 bad crc in data 3922083958 != exp 3479198006 from v1:192.168.131.2:0/2732728486
192.168.131.2 is node B, which is the node osd.7 is on.

and so there are others on other nodes and osd's. From what I understand this means that data copied from some other osd to this one that logs the error fails the crc test. However, I have taken (as a test) one of these ssd's out of the cluster and it tests just fine. I put it back and no crc errors are logged for it.

Question: Can something else we causing this? A network connector? It seems pretty random so me, so how can I trace this sources of this?

It was suggested to me by die vendor tech support that turning offloading off on the nic: /ethtool -K vmbr0 rx off tx off tso off gso off gro off lro off./  That doesn't seem to make a difference however.

The next step would be to check for a MTU/jumbo mismatch.


Some details: ceph version 17.2.7

# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME  STATUS  REWEIGHT  PRI-AFF
 -1         62.29883  root default
 -3         15.17824      host FT1-NodeA
  2    hdd   1.86029          osd.2           up   1.00000 1.00000
  3    hdd   1.86029          osd.3           up   1.00000 1.00000
  4    hdd   1.86029          osd.4           up   1.00000 1.00000
  5    hdd   1.86589          osd.5           up   1.00000 1.00000
  0    ssd   0.93149          osd.0           up   1.00000 1.00000
 28    ssd   3.30690          osd.28          up   1.00000 1.00000
 29    ssd   3.49309          osd.29          up   1.00000 1.00000
 -7         16.10413      host FT1-NodeB
 10    hdd   1.86029          osd.10          up   1.00000 1.00000
 11    hdd   1.86029          osd.11          up   1.00000 1.00000
 26    hdd   1.86029          osd.26          up   1.00000 1.00000
 27    hdd   1.86029          osd.27          up   1.00000 1.00000
  6    ssd   0.93149          osd.6           up   1.00000 1.00000
  7    ssd   0.93149          osd.7           up   1.00000 1.00000
 25    ssd   3.49309          osd.25          up   1.00000 1.00000
 41    ssd   3.30690          osd.41          up   1.00000 1.00000
-10         15.84383      host FT1-NodeC
 14    hdd   1.59999          osd.14          up   1.00000 1.00000
 15    hdd   1.86029          osd.15          up   1.00000 1.00000
 16    hdd   1.86029          osd.16          up   1.00000 1.00000
 17    hdd   1.86029          osd.17          up   1.00000 1.00000
  8    ssd   0.93149          osd.8           up   1.00000 1.00000
  9    ssd   0.93149          osd.9           up   1.00000 1.00000
 24    ssd   3.49309          osd.24          up   1.00000 1.00000
 43    ssd   3.30690          osd.43          up   1.00000 1.00000
-13         15.17264      host FT1-NodeD
 20    hdd   1.86029          osd.20          up   1.00000 1.00000
 21    hdd   1.86029          osd.21          up   1.00000 1.00000
 22    hdd   1.86029          osd.22          up   1.00000 1.00000
 23    hdd   1.86029          osd.23          up   1.00000 1.00000
 12    ssd   3.30690          osd.12          up   1.00000 1.00000
 13    ssd   0.93149          osd.13          up   1.00000 1.00000
 19    ssd   3.49309          osd.19          up   1.00000 1.00000

The spinners have their RocksDB on the nvme's, for extra performance, but I have not noticed any crc errors prior to installing the 4TB Samsungs in each node.

Suggestions are welcome.

thanks

Roland
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to