Hello,
I have a FreeBSD used as an office router for an organisation. It was
installed years ago and its configuration is:
- IBM system x3250m2
- 4 gigs RAM
- Intel Xeon E3110
- two WDC WD5000AADS-0 in a zfs mirror.
- bge(4) NetXtreme BCM5722 Gigabit Ethernet PCI Express, initially present
- em(4) 82572EI Gigabit Ethernet Controller (Copper) adapter, added later
This server is connected with both WAN and LAN using one bge(4) link to
a Cisco catalyst 2960, comprising several vlans.
After several years of running, starting from 10.x, when 12.2 was
already installed for quite some time, I started having a huge number of
input errors on an interface, that were increasing the dev.bge.0
counters like
dev.bge.0.stats.InputDiscards
Error input rate was changing from 0 (most of the time) to 6K-80K per
second. The observed interface input rate was floating around 300 Mbps
during it's peak.
Sample of netstat -I bge0 1 showing the moment when there's bunch of
errors and the amount of traffic:
input bge0 output
packets errs idrops bytes packets errs bytes colls
20695 701 0 19244062 18182 0 17059035 0
929 61003 0 938482 438 0 118494 0
1383 44667 0 1094321 537 0 196633 0
11726 1 0 8904828 11560 0 9012153 0
6116 0 0 3991680 6051 0 4001106 0
4772 0 0 3210074 4769 0 3224114 0
9679 0 0 8507153 9622 0 8719630 0
12355 0 0 10212352 12251 0 10288762 0
2975 0 0 1457118 2946 0 1466755 0
4397 0 0 3051610 4377 0 3056513 0
4782 0 0 3405659 4806 0 3501414 0
9202 0 0 7891629 9204 0 8080658 0
The catalyst shows to output errors when this is happening on the port
that FreeBSD is connected to.
Recovery measures that I attempted and that failed to resolve the
sutuation (one step at a time):
- changed the patch cable from catalyst
- changed the onboard port from 1 to 0
- started to suspect the onboard ethernet controller, added the Intel
Pro/1000 MT external adapter via the riser card, error rate migrated
into the dev.em.0.mac_stats.missed_packets counter, sometimes triggering
the dev.em.0.mac_stats.recv_no_buff:
dev.em.0.mac_stats.recv_no_buff: 9424
dev.em.0.mac_stats.missed_packets: 1853592
- added the iflib/netmap tuning:
net.isr.numthreads="2"
net.isr.maxthreads="2"
dev.em.0.iflib.rx_budget="65535"
dev.em.0.iflib.override_nrxds="4096"
dev.em.0.iflib.override_ntxds="4096"
dev.em.0.iflib.disable_msix="0"
- added the interrupt moderation
dev.em.0.rx_int_delay="200"
dev.em.0.tx_int_delay="200"
dev.em.0.rx_abs_int_delay="4000"
dev.em.0.tx_abs_int_delay="4000"
- tried to play with the kern.eventtimer
kern.eventtimer.periodic="1"
- compiled out the em(4) driver from the kernel to dynamically loading
module
- changed the module from stock one to the one from net/intel-em-kmod
port (with netmap compiled out. at this point errors even stopped for
almost a day, and I was quick enough to report this as a regression into
the FreeBSD bugtracker (I closed the bug as misdiagnosed after realizing
this didn't help)).
- upgraded the system to the FreeBSD 13.0
I also noticed that there's no correlation between reboots and the error
flow: sometimes the er ror counter could start to increase right
after booting, sometimes several hours could pass.
After realizing there's no options left, we switched to router to the
neighbor server running FreeBSD 12.1-STABLE (I don't suspect the version
that much, it was just running it) and at this time the errors stopped
(however the network adapter there is igb(4)). After removing the load
from the x3250 we did a full memtest scan, which reported no errors
during several passes (didn't suspect the memory to be the root cause
anyway, since the old one was able to build the world successfully,
which is almost impossible when havingg memory issues).
So - the obvious question is - what can be the cause of such errors ?
Lack of system memory (the only thing that comes to mind) ?
The memory distribution is like (when idle):
Mem: 38M Active, 751M Inact, 2122M Wired, 988M Free
ARC: 1064M Total, 312M MFU, 426M MRU, 575K Anon, 32M Header, 275M Other
559M Compressed, 1750M Uncompressed, 3,13:1 Ratio
Swap: 2048M Total, 2048M Free
But when loaded, there's almost no free memory. However, I've checked
the netstat -m, and it reports to mbums requests were denied. CPU isn't
loaded like at all during the peak input rate, or during the momemnts of
time when the errors starts to stack.
Thanks.
Eugene.