Hi, Intel 82576 is.... bad. I've seen quite a few problems with these older igb familiy NICs, but losing the PCIe link is a new one. I usually see them getting stuck with a message like "tx queue X hung, resetting device..."
Try to disable offloading features using ethtool, that sometimes helps with the problems that I've seen. Maybe that's just a variant of the stuck problem? Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes <geoff...@rhodes.org.za> wrote: > Hi Cephers, > > I've been having an issue since upgrading my cluster to Mimic 6 months ago > (previously installed with Luminous 12.2.1). > All nodes that have the same PCIe network card seem to loose network > connectivity randomly. (frequency ranges from a few days to weeks per host > node) > The affected nodes only have the Intel 82576 LAN Card in common, different > motherboards, installed drives, RAM and even PSUs. > Nodes that have the Intel I350 cards are not affected by the Mimic upgrade. > Each host node has recommended RAM installed and has between 4 and 6 OSDs > / sata hard drives installed. > The cluster operated for over a year (Luminous) without a single issue, > only after the Mimic upgrade did the issues begin with these nodes. > The cluster is only used for CephFS (file storage, low intensity usage) > and makes use of erasure data pool (K=4, M=2). > > I've tested many things, different kernel versions, different Ubuntu LTS > releases, re-installation and even CENTOS 7, different releases of Mimic, > different igb drivers. > If I stop the ceph-osd daemons the issue does not occur. If I swap out > the Intel 82576 card with the Intel I350 the issue is resolved. > I haven't any more ideas other than replacing the cards but I feel the > issue is linked to the ceph-osd daemon and a change in the Mimic release. > Below are the various software versions and drivers I've tried and a log > extract from a node that lost network connectivity. - Any help or > suggestions would be greatly appreciated. > > *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 > *Ceph Version:* Mimic (currently 13.2.6) > *Network card:* 4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) > *Driver: * igb > *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k > *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x > bonded (LACP) 1GB nic for private net > *Log errors:* > Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0 > enp3s0f0: PCIe link lost, device now detached > Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1 > enp4s0f1: PCIe link lost, device now detached > Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1 > enp3s0f1: PCIe link lost, device now detached > Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0 > enp4s0f0: PCIe link lost, device now detached > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from > 10.100.4.1:6809 osd.16 since back 2019-06 > -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 > 12:10:23.796726) > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from > 10.100.6.1:6804 osd.20 since back 2019-06 > -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 > 12:10:23.796726) > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from > 10.100.7.1:6803 osd.25 since back 2019-06 > -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 > 12:10:23.796726) > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from > 10.100.8.1:6803 osd.30 since back 2019-06 > -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 > 12:10:23.796726) > Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 > 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from > 10.100.9.1:6808 osd.43 since back 2019-06 > -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 > 12:10:23.796726) > > > Kind regards > Geoffrey Rhodes > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com