Hi Paul, *Thanks I've run the following on all four interfaces:* sudo ethtool -K <interface> rx off tx off sg off tso off ufo off gso off gro off lro off rxvlan off txvlan off ntuple off rxhash off
*The following do not seen to be available to change:* Cannot change udp-fragmentation-offload Cannot change large-receive-offload Cannot change ntuple-filters Holding thumbs this helps however I still don't understand why the issue only occurs on ceph-osd nodes. ceph-mon and ceph-mds nodes and even a cech client with the same adapters do not have these issues. Kind regards Geoffrey Rhodes On Thu, 18 Jul 2019 at 18:35, Paul Emmerich <paul.emmer...@croit.io> wrote: > Hi, > > Intel 82576 is.... bad. I've seen quite a few problems with these older > igb familiy NICs, but losing the PCIe link is a new one. > I usually see them getting stuck with a message like "tx queue X hung, > resetting device..." > > Try to disable offloading features using ethtool, that sometimes helps > with the problems that I've seen. Maybe that's just a variant of the stuck > problem? > > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes <geoff...@rhodes.org.za> > wrote: > >> Hi Cephers, >> >> I've been having an issue since upgrading my cluster to Mimic 6 months >> ago (previously installed with Luminous 12.2.1). >> All nodes that have the same PCIe network card seem to loose network >> connectivity randomly. (frequency ranges from a few days to weeks per host >> node) >> The affected nodes only have the Intel 82576 LAN Card in common, >> different motherboards, installed drives, RAM and even PSUs. >> Nodes that have the Intel I350 cards are not affected by the Mimic >> upgrade. >> Each host node has recommended RAM installed and has between 4 and 6 OSDs >> / sata hard drives installed. >> The cluster operated for over a year (Luminous) without a single issue, >> only after the Mimic upgrade did the issues begin with these nodes. >> The cluster is only used for CephFS (file storage, low intensity usage) >> and makes use of erasure data pool (K=4, M=2). >> >> I've tested many things, different kernel versions, different Ubuntu LTS >> releases, re-installation and even CENTOS 7, different releases of Mimic, >> different igb drivers. >> If I stop the ceph-osd daemons the issue does not occur. If I swap out >> the Intel 82576 card with the Intel I350 the issue is resolved. >> I haven't any more ideas other than replacing the cards but I feel the >> issue is linked to the ceph-osd daemon and a change in the Mimic release. >> Below are the various software versions and drivers I've tried and a log >> extract from a node that lost network connectivity. - Any help or >> suggestions would be greatly appreciated. >> >> *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 >> *Ceph Version:* Mimic (currently 13.2.6) >> *Network card:* 4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) >> *Driver: * igb >> *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k >> *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x >> bonded (LACP) 1GB nic for private net >> *Log errors:* >> Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0 >> enp3s0f0: PCIe link lost, device now detached >> Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1 >> enp4s0f1: PCIe link lost, device now detached >> Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1 >> enp3s0f1: PCIe link lost, device now detached >> Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0 >> enp4s0f0: PCIe link lost, device now detached >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.4.1:6809 osd.16 since back 2019-06 >> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.6.1:6804 osd.20 since back 2019-06 >> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.7.1:6803 osd.25 since back 2019-06 >> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.8.1:6803 osd.30 since back 2019-06 >> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 >> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from >> 10.100.9.1:6808 osd.43 since back 2019-06 >> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 >> 12:10:23.796726) >> >> >> Kind regards >> Geoffrey Rhodes >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com