Benjamin Rosenblum wrote:

the em driver in itself is extremly buggy. many people, myself included, are hitting some major problems with this driver that are causeing some serious issues. i cant transfer any large files to my server because the em driver panics and drops the connection for 15-20 seconds. its a real pain in the butt when this happens too cause this is my primary network storage server. i have had to resort to the backup systems lately because of this problem. i think the entire em network driver needs to get reworked and all these bugs really need to be taken care of since this is one of the top like 3 network cards used in the field today for gig transfer.

Does anyone have the programming data for the chipsets so the driver could be taken further? I've been unable to obtain them from Intel despite of repeated attempts.

Pete

Gleb Smirnoff wrote:

 Colleagues,

 during last month we are experiencing a nasty problem with em(4)
driver. Several times a day the receive path of the driver wedges
for a minute or two. During wedge the transmit part works with
no problems. The latter fact makes this problem very nasty, because
the problematic router can't be backed up with help of CARP.

Some details: during the wedge all incoming packets are lost and
counted as "Missed packets". I've checked this using
`sysctl dev.em.0.stats=1`. The `dmesg` output is the following:

em0: Excessive collisions = 0
em0: Symbol errors = 0
em0: Sequence errors = 0
em0: Defer count = 0
em0: Missed Packets = 1266
em0: Receive No Buffers = 220
em0: Receive length errors = 0
em0: Receive errors = 0
em0: Crc errors = 0
em0: Alignment errors = 0
em0: Carrier extension errors = 0
em0: XON Rcvd = 0
em0: XON Xmtd = 0
em0: XOFF Rcvd = 0
em0: XOFF Xmtd = 0
em0: Good Packets Rcvd = 28347789
em0: Good Packets Xmtd = 30911959

There is a clear evidence that command `sysctl dev.em.0.stats=1` itself
can trigger the wedge. It is important, that the stats are printed
to a 9600 baud serial console, and this takes about a second. I have
suspicion, that the wedge happens when kernel doesn't service NIC
interrupts for some period of time. Yes, some packets should be lost in
this case, but the wedge must not continue for minutes!

The box is serving 8 - 15 kpps, 70 - 100 MBps. It runs stateful pf(4)
firewall, with 50k - 80k states. The IP fastforwarding is enabled. The
average state insert/removal ratio is 300 states per second, however
sometimes several thousands of states can be removed in one pass. The
state removal locks the network code for quite a long time, so I guess
that wedge happens exactly when a lot of states are removed. The NIC
interrupts aren't serviced for some time and it wedges.

The hardware is Supermicro server, with two onboard NICs:
dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000 dev.em.1.%pnpinfo: vendor=0x8086 device=0x1076 subvendor=0x8086 subdevice=0x1076 class=0x020000

The NIC is plugged in Cisco Catalyst 6509 gigabit ethernet port. No
errors are counted on switch port.

To workaround the problem, I have made the following patch:

@@ -1650,12 +1651,18 @@
       struct ifnet   *ifp;
       struct adapter * adapter = arg;
       ifp = adapter->ifp;
+       uint64_t        ompc;

       EM_LOCK(adapter);

       em_check_for_link(&adapter->hw);
       em_print_link_status(adapter);
- em_update_stats_counters(adapter); + ompc = adapter->stats.mpc;
+       em_update_stats_counters(adapter);
+       if (adapter->stats.mpc > ompc) {
+ printf("em watchdog: mpc %lld->%lld\n", ompc, adapter->stats.mpc);
+               em_init_locked(adapter);
+       }
if (em_display_debug_stats && ifp->if_drv_flags & IFF_DRV_RUNNING) {
               em_print_hw_stats(adapter);
       }

It helps to reduce downtime from few minutes to 2 seconds, but this
is very dirty approach to the problem. Sample prints during runtime
with patch:

em watchdog: mpc 1767->2739
em watchdog: mpc 2739->4724
em watchdog: mpc 4724->7794
em watchdog: mpc 7794->10729

Every time this is printed, the network wedges for 2 seconds and then
it revives.

I am asking developers, who work in Intel, to pay attention to this problem.

From my side I can offer any help in testing and debugging.




_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to