Hi Paul,

*Thanks I've run the following on all four interfaces:*
sudo ethtool -K <interface> rx off tx off sg off tso off ufo off gso off
gro off lro off rxvlan off txvlan off ntuple off rxhash off

*The following do not seen to be available to change:*
Cannot change udp-fragmentation-offload
Cannot change large-receive-offload
Cannot change ntuple-filters

Holding thumbs this helps however I still don't understand why the issue
only occurs on ceph-osd nodes.
ceph-mon and ceph-mds nodes and even a cech client with the same adapters
do not have these issues.

Kind regards
Geoffrey Rhodes


On Thu, 18 Jul 2019 at 18:35, Paul Emmerich <paul.emmer...@croit.io> wrote:

> Hi,
>
> Intel 82576 is.... bad. I've seen quite a few problems with these older
> igb familiy NICs, but losing the PCIe link is a new one.
> I usually see them getting stuck with a message like "tx queue X hung,
> resetting device..."
>
> Try to disable offloading features using ethtool, that sometimes helps
> with the problems that I've seen. Maybe that's just a variant of the stuck
> problem?
>
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes <geoff...@rhodes.org.za>
> wrote:
>
>> Hi Cephers,
>>
>> I've been having an issue since upgrading my cluster to Mimic 6 months
>> ago (previously installed with Luminous 12.2.1).
>> All nodes that have the same PCIe network card seem to loose network
>> connectivity randomly. (frequency ranges from a few days to weeks per host
>> node)
>> The affected nodes only have the Intel 82576 LAN Card in common,
>> different motherboards, installed drives, RAM and even PSUs.
>> Nodes that have the Intel I350 cards are not affected by the Mimic
>> upgrade.
>> Each host node has recommended RAM installed and has between 4 and 6 OSDs
>> / sata hard drives installed.
>> The cluster operated for over a year (Luminous) without a single issue,
>> only after the Mimic upgrade did the issues begin with these nodes.
>> The cluster is only used for CephFS (file storage, low intensity usage)
>> and makes use of erasure data pool (K=4, M=2).
>>
>> I've tested many things, different kernel versions, different Ubuntu LTS
>> releases, re-installation and even CENTOS 7, different releases of Mimic,
>> different igb drivers.
>> If I stop the ceph-osd daemons the issue does not occur.  If I swap out
>> the Intel 82576 card with the Intel I350 the issue is resolved.
>> I haven't any more ideas other than replacing the cards but I feel the
>> issue is linked to the ceph-osd daemon and a change in the Mimic release.
>> Below are the various software versions and drivers I've tried and a log
>> extract from a node that lost network connectivity. - Any help or
>> suggestions would be greatly appreciated.
>>
>> *OS:*                          Ubuntu 16.04 / 18.04 and recently CENTOS 7
>> *Ceph Version:*        Mimic (currently 13.2.6)
>> *Network card:*        4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)
>> *Driver:              *       igb
>> *Driver Versions:*     5.3.0-k / 5.3.5.22s / 5.4.0-k
>> *Network Config:*     2 x bonded (LACP) 1GB nic for public net,   2 x
>> bonded (LACP) 1GB nic for private net
>> *Log errors:*
>> Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0
>> enp3s0f0: PCIe link lost, device now detached
>> Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1
>> enp4s0f1: PCIe link lost, device now detached
>> Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1
>> enp3s0f1: PCIe link lost, device now detached
>> Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0
>> enp4s0f0: PCIe link lost, device now detached
>> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
>> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
>> 10.100.4.1:6809 osd.16 since back 2019-06
>> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
>> 12:10:23.796726)
>> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
>> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
>> 10.100.6.1:6804 osd.20 since back 2019-06
>> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
>> 12:10:23.796726)
>> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
>> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
>> 10.100.7.1:6803 osd.25 since back 2019-06
>> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
>> 12:10:23.796726)
>> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
>> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
>> 10.100.8.1:6803 osd.30 since back 2019-06
>> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
>> 12:10:23.796726)
>> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
>> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
>> 10.100.9.1:6808 osd.43 since back 2019-06
>> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
>> 12:10:23.796726)
>>
>>
>> Kind regards
>> Geoffrey Rhodes
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to