On Thu, Jul 1, 2010 at 1:18 PM, alan bryan <alan.br...@yahoo.com> wrote: > > > --- On Thu, 7/1/10, Garrett Cooper <yanef...@gmail.com> wrote: > >> From: Garrett Cooper <yanef...@gmail.com> >> Subject: Re: NFS 75 second stall >> To: "alan bryan" <alan.br...@yahoo.com> >> Cc: freebsd-stable@freebsd.org >> Date: Thursday, July 1, 2010, 12:23 PM >> On Thu, Jul 1, 2010 at 11:51 AM, alan >> bryan <alan.br...@yahoo.com> >> wrote: >> > >> > >> > --- On Thu, 7/1/10, Garrett Cooper <yanef...@gmail.com> >> wrote: >> > >> >> From: Garrett Cooper <yanef...@gmail.com> >> >> Subject: Re: NFS 75 second stall >> >> To: "alan bryan" <alan.br...@yahoo.com> >> >> Cc: freebsd-stable@freebsd.org >> >> Date: Thursday, July 1, 2010, 11:13 AM >> >> On Thu, Jul 1, 2010 at 11:01 AM, alan >> >> bryan <alan.br...@yahoo.com> >> >> wrote: >> >> > Setup: >> >> > >> >> > server - FreeBSD 8-stable from today. 2 UFS >> dirs >> >> exported via NFS. >> >> > client - FreeBSD 8.0-Release. Running a >> test php >> >> script that copies around various files to/from 2 >> separate >> >> NFS mounts. >> >> > >> >> > Situation: >> >> > >> >> > script is started (forked to do 20 >> simultaneous runs) >> >> and 20 1GB files are copied to the NFS dir which >> works >> >> fine. When it then switches to reading those >> files back >> >> and simultaneously writing to the other NFS mount >> I see a >> >> hang of 75 seconds. If I do an "ls -l" on the >> NFS mount it >> >> hangs too. After 75 seconds the client has >> reported: >> >> > >> >> > nfs server 192.168.10.133:/usr/local/export1: >> not >> >> responding >> >> > nfs server 192.168.10.133:/usr/local/export1: >> is alive >> >> again >> >> > nfs server 192.168.10.133:/usr/local/export1: >> not >> >> responding >> >> > nfs server 192.168.10.133:/usr/local/export1: >> is alive >> >> again >> >> > >> >> > and then things start working again. The >> server was >> >> originally FreeBSD 8.0-Release also but was >> upgraded to the >> >> latest stable to see if this issue could be >> avoided. >> >> > >> >> > # nfsstat -s -W -w 1 >> >> > GtAttr Lookup Rdlink Read Write >> Rename >> >> Access Rddir >> >> > 0 0 0 222 >> 257 >> >> 0 0 0 >> >> > 0 0 0 178 >> 135 >> >> 0 0 0 >> >> > 0 0 0 85 >> 127 >> >> 0 0 0 >> >> > 0 0 0 0 >> 0 >> >> 0 0 0 >> >> > 0 0 0 0 >> 0 >> >> 0 0 0 >> >> > 0 0 0 0 >> 0 >> >> 0 0 0 >> >> > 0 0 0 0 >> 0 >> >> 0 0 0 >> >> > 0 0 0 0 >> 0 >> >> 0 0 0 >> >> > >> >> > ... for 75 rows of all zeros >> >> > >> >> > 0 0 0 272 >> 266 >> >> 0 0 0 >> >> > 0 0 0 167 >> 165 >> >> 0 0 0 >> >> > >> >> > I also tried runs with 15 simultaneous >> processes and >> >> 25. 15 processes gave only about a 5 second >> stall but 25 >> >> gave again the same 75 second stall. >> >> > >> >> > Further, I tested with 2 mounts to the same >> server but >> >> from ZFS filesytems with the exact same >> stall/timeout >> >> periods. So, it doesn't appear to matter what >> the >> >> underlying filesystem is - it's something in NFS >> or >> >> networking code. >> >> > >> >> > Any ideas on what's going on here? What's >> causing >> >> the complete stall period of zero NFS activity? >> Any flaws >> >> with my testing methods? >> >> > >> >> > Thanks for any and all help/ideas. >> >> >> >> What network driver are you using? Have you tried >> >> tcpdumping the packets? >> >> -Garrett >> >> >> > >> > I'm using igb currently but have also used em. I >> have not tried tcpdumping the packets yet on this test. >> Any suggestions on things to look out for (I'm not that >> familiar with that whole process). >> > >> > Which brings up another point - I'm using TCP >> connections for NFS, not UDP. >> >> Is the net.inet.tcp.tso sysctl enabled or >> not? What about rxcsum and txcsum? >> Thanks, >> -Garrett >> > > I haven't intentionally/explicitly set any of this so it's "default": > > # sysctl net.inet.tcp.tso > net.inet.tcp.tso: 1 > > > igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > options=13b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,TSO4> > ether 00:30:48:c3:26:94 > inet 192.168.10.133 netmask 0xffffff00 broadcast 192.168.10.255 > media: Ethernet autoselect (1000baseT <full-duplex>) > status: active
Devise all of the available permutations that you need to use to test this out; there are a total of 3 variables, so 9 permutations, but you've already `tested one', so that makes the permutation count 8. Example: TXCSUM=off, RXCSUM=on, TSO=on TXCSUM=on, RXCSUM=off, TSO=on TXCSUM=on, RXCSUM=off, TSO=off ... Try executing the permutations on the client first, keeping the server constant, then make the client constant and make the server variable, and finally do both to the server and client. Be sure to take measurements for each permutation to ensure that things make functional sense. The reason why I'm suggesting this is that there were issues with em(4) [and igb(4) too I think since it uses common code], with various hardware offload bits on 8.0-RELEASE (IIRC disabling txcsum did the trick, but you may have to do more than that in order to get things to work). Here's a similar thread with a different driver: http://lists.freebsd.org/pipermail/freebsd-current/2009-June/008264.html (just to illustrate the thought process used to determine the source of failure). Thanks, -Garrett _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"