On Tue, Aug 17, 2010 at 10:54:04AM -0700, Mark Morley wrote: > On Sun, 15 Aug 2010 23:35:50 -0700 Jeremy Chadwick <free...@jdc.parodius.com> > wrote: > >On Thu, Aug 12, 2010 at 10:35:49AM -0700, Mark Morley wrote: > >> I have five front end web servers that all mount their content from > >> the same server via NFS. If I stress the link on any one of the > >> machines (eg: copy a large directory with a lot of files to/from the > >> mounted file system) the client will pause. That is, all processes > >> trying to access that mount will freeze. The log files with hundreds > >> or thousands of nfs server not responding / is alive again messages. > >> After 60 seconds it returns to normal, unless the load is still there > >> in which case it continues to pause. > >> > >> This has only started happening since I upgraded the client machines > >> to 8.1-STABLE (previously four of them were 8.0 and one was 7.3). The > >> server is 7.1-RELEASE-p11. No other changes have taken place in terms > >> of hardware or software or mount options, etc. > >> > >> All nics involved are gigabit em cards, and they are on a private > >> network (web access to the boxes is via an external interface). > > > >Are there any indications in dmesg that the NIC is responsible, e.g. > >interface down/up, etc.? > > No, nothing like that. > > >Does switching to UDP-based NFS solve the problem for you? > > Trying that now for the past 24 hours or so. Four of the machine seem ok so > far, but the fifth one has started dropping the mount entirely. Access to it > gives an "Input / output error" message. Forcing a dismount and remounting > brings it back. > > >What OS version (uname -a) and NIC are used on the NFS server? > > FreeBSD xxx 7.1-RELEASE-p11 FreeBSD 7.1-RELEASE-p11 #0: Wed May 26 03:20:59 > PDT 2010 > r...@xxx:/usr/obj/usr/src/sys/CUSTOM i386 > > NICs are em > > >Can you please provide the following output from one of the client > >machines running 8.1-STABLE with gigE em(4)? You can X-out machine > >names, MAC addresses, and IP addresses/netblocks if need be. > > > >* uname -a > > FreeBSD xxx 8.1-STABLE FreeBSD 8.1-STABLE #0: Tue Jul 27 16:27:44 PDT 2010 > r...@xxx:/usr/obj/usr/src/sys/CUSTOM amd64 > > >* ifconfig emX (where X is the interface number which would be > > used for NFS) > > em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > options=209b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,WOL_MAGIC> > ether 00:0e:0c:85:d5:0d > inet 192.168.1.30 netmask 0xffffff00 broadcast 192.168.1.255 > media: Ethernet 1000baseT <full-duplex> > status: active > > >* netstat -idn -I emX > > Name Mtu Network Address Ipkts Ierrs Idrop Opkts > Oerrs Coll Drop > em0 1500 <Link#1> 00:0e:0c:85:d5:0d 39913814 2 0 39949943 > 0 0 0 > em0 1500 192.168.1.0/2 192.168.1.30 39944016 - - 39949664 > - - - > > >* pciconf -lvc (provide only the data for emX please) > > e...@pci0:1:6:0: class=0x020000 card=0x13768086 chip=0x107c8086 rev=0x05 > hdr=0x00 > vendor = 'Intel Corporation' > device = 'Gigabit Ethernet Controller (Copper) rev 5 (82541PI)' > class = network > subclass = ethernet > cap 01[dc] = powerspec 2 supports D0 D3 current D0 > cap 07[e4] = PCI-X supports 2048 burst read, 1 split transaction > > >* vmstat -i > > interrupt total rate > irq1: atkbd0 239 0 > irq16: em0 36746591 883 > irq18: em1 12658607 304 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I'm ignoring em1 because em0 is the one which has the NFS traffic, and em1 could in fact be a different model of Intel NIC (it's very common for server vendors to include two different models of NIC on the same board; sure, both em(4), but different models), so I'm staying focused on em0. The interrupt rate here looks quite high for a system that may not be doing anything (I don't know for sure). Can you provide output from "netstat -I em0 -n -b 1" and let it run for about 60 seconds? This should be done both when NFS is UDP-only, as well as when NFS is TCP-only. I'm curious what kind of network throughput you're seeing (in attempt to correlate it with high interrupt rates). If network I/O is very low yet the interrupt rate is very high, the problem may be a driver bug or something with PCI configuration/initialisation. I'm also CC'ing Jack Vogel of Intel, who may have some insight to what's going on here. > irq21: ohci0 2 0 > irq22: ehci0 528002 12 > irq23: atapci1 2334936 56 > cpu0: timer 83207296 2000 > cpu1: timer 83207289 2000 > Total 218682962 5256 > > >* sysctl hw.pci > > hw.pci.usb_early_takeover: 1 > hw.pci.honor_msi_blacklist: 1 > hw.pci.enable_msix: 1 > hw.pci.enable_msi: 1 > hw.pci.do_power_resume: 1 > hw.pci.do_power_nodriver: 0 > hw.pci.enable_io_modes: 1 > hw.pci.default_vgapci_unit: -1 > hw.pci.host_mem_start: 2147483648 > hw.pci.mcfg: 1 > > >* As root, run "sysctl dev.em.X.stats=1" then do "dmesg" and > > provide the output for NIC statistics (will start with "emX:") > > em0: Excessive collisions = 0 > em0: Sequence errors = 0 > em0: Defer count = 52 > em0: Missed Packets = 0 > em0: Receive No Buffers = 0 > em0: Receive Length Errors = 0 > em0: Receive errors = 1 > em0: Crc errors = 1 > em0: Alignment errors = 0 > em0: Collision/Carrier extension errors = 0 > em0: RX overruns = 0 > em0: watchdog timeouts = 0 > em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0 > em0: XON Rcvd = 54 > em0: XON Xmtd = 0 > em0: XOFF Rcvd = 54 > em0: XOFF Xmtd = 0 > em0: Good Packets Rcvd = 39915088 > em0: Good Packets Xmtd = 39951839 -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"