NFE adapter 'hangs'

Melissa Jenkins Thu, 02 Sep 2010 02:21:12 -0700

Hiya,

I've been having trouble with two different machines (FBSD 8.0p3 & FBSD 7.0p5) 
using the NFE network adapter.  The machines are, respectively, Sun X2200 
(AMD64) and a Sun X2100M2 (AMD64) and both are running the amd64 kernel.


Basically what appears to happen is that traffic stops flowing through the 
interface and 'No buffer space available' error messages are produced when 
trying to send icmp packets. All establish connections appear to hang.

The machines are running as packet routers, and nfe0 is acting as the 'lan' 
side.  PF is being used for filtering, NAT, BINAT and RDR.  The same PF 
configuration works correctly on two other servers using different network 
adapters. One of them is configured with pfsync & CARP, but the other one isn't.

The problem seems to happen under fairly light number of sessions ( < 100 
active states in PF) though the more states the quicker it occurs.  It is 
possible it's related to packet rates as putting on high bandwidth clients 
seems to produce the problem very quickly (several minutes) This is reinforced 
by the fact that the problem first manifested when we upgraded one of the 
leased lines.

Executing ifconfig nfe0 down && ifconfig nfe0 up will restart traffic flow.  

Neither box is very highly loaded, generally around ~ 1.5 Mb/s.  This doesn't 
appear to be related to the amount of traffic as I have tried re-routing 95% of 
traffic around the server without any improvement in performance.  The traffic 
profile is fairly random - a mix of TCP and UDP, mostly flowing OUT of nfe0.  
It is all L3 and there are  less than 5 hosts on the segment attached to the 
nfe interface.

Both boxes are in different locations and are connected to different types of 
Cisco switches.  Both appear to autonegotiate correctly and the switch ports 
show no status changes.

It appears that PFSync, CARP & a GRE tunnel works correctly over the NFE 
interface for long periods of time (weeks +) And that it is something to do 
adding other traffic to the mix that is resulting in the interface 'hanging'.

If I move the traffic from NFE to the other BGE interface (the one shared with 
the LOM) everything is stable and works correctly.  I have not been able to 
reproduce this using test loads, and the interface worked correctly with iperf 
testing prior to deployment.  I unfortunately (legal reasons) can't provide a 
traffic trace up to the time it occurs though everything looks normal to me.

The FreeBSD 7 X2100 lists the following from PCI conf:
n...@pci0:0:8:0:        class=0x068000 card=0x534c108e chip=0x037310de rev=0xa3 
hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge
n...@pci0:0:9:0:        class=0x068000 card=0x534c108e chip=0x037310de rev=0xa3 
hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge

The FreeBSD 8 X2200 lists the same thing:
n...@pci0:0:8:0:        class=0x068000 card=0x534b108e chip=0x037310de rev=0xa3 
hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge
n...@pci0:0:9:0:        class=0x068000 card=0x534b108e chip=0x037310de rev=0xa3 
hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge


Here are the two obvious tests (both from the FreeBSD 7 box), but the icmp 
response & the mbuf stats are very much the same on both boxes.

ping 172.31.3.129
PING 172.31.3.129 (172.31.3.129): 56 data bytes
ping: sendto: No buffer space available
ping: sendto: No buffer space available
^C

-- 172.31.3.129 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss

netstat -m
852/678/1530 mbufs in use (current/cache/total)
818/448/1266/25600 mbuf clusters in use (current/cache/total/max)
817/317 mbuf+clusters out of packet secondary zone in use (current/cache)
0/362/362/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
1879K/2513K/4392K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

From the other machine, after the problem has occurred & and ifconfig down/up 
cycle has been done (ie when the interface is working)
vmstat -z 
mbuf_packet:              256,        0,     1033,     1783, 330792410,        0
mbuf:                     256,        0,        5,     1664, 395145472,        0
mbuf_cluster:            2048,    25600,     2818,     1690, 13234653,        0
mbuf_jumbo_page:         4096,    12800,        0,      336,   297749,        0
mbuf_jumbo_9k:           9216,     6400,        0,        0,        0,        0
mbuf_jumbo_16k:         16384,     3200,        0,        0,        0,        0
mbuf_ext_refcnt:            4,        0,        0,        0,        0,        0


Although I failed to keep a copy I don't believe there is a kmem problem

I'm at a complete loss as to what to try next :(  

All suggestions very gratefully received!!!  The 7.0 box is live so can't 
really be played with but I can occasionally run tests on the other box

Thank you :)
Mel


_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

NFE adapter 'hangs'

Reply via email to