Re: if_run in hostap mode: issue with stations in the power save mode

2011-02-08 Thread Alexander Zagrebin
Hi!

On 07.02.2011 09:11:02 +0100, Bernhard Schmidt wrote:

> For example, if you call 'ifconfig wlan0 ssid ' the new ssid is 
> passed over using a IOCTL. It would be interesting to know what function 
> in net80211 are called regarding beacon updates and which of those call 
> into the run driver. Ultimately it's about figuring out if special 
> handling for such cases are required and if so, how to do it.

I've added a debug output on allocation, changing and deallocation of a
beacon into if_run.c and tried to change SSID while the net.wlan.0.debug is -1.
Here is the log contents:

kernel: wlan0: ieee80211_init
kernel: wlan0: start running, 1 vaps running
kernel: wlan0: ieee80211_new_state_locked: RUN -> SCAN (nrunning 0 nscanning 0)
kernel: wlan0: ieee80211_newstate_cb: RUN -> INIT arg 0
kernel: wlan0: hostap_newstate: RUN -> INIT (0)
kernel: wlan0: node_reclaim: remove 0xff8003bd7000<00:14:d1:a8:66:1d> from 
station table, refcnt 1
kernel: wlan0: ieee80211_alloc_node 0xff8004eae000<00:14:d1:a8:66:1d> in 
station table
kernel: wlan0: [00:14:d1:a8:66:1d] ieee80211_alloc_node: inact_reload 2
kernel: wlan0: ieee80211_newstate_cb: INIT -> SCAN arg 0
kernel: wlan0: hostap_newstate: INIT -> SCAN (0)
kernel: wlan0: ieee80211_create_ibss: creating HOSTAP on channel 6
kernel: wlan0: ieee80211_alloc_node 0xff8003bd7000<00:14:d1:a8:66:1d> in 
station table
kernel: 
kernel: wlan0: [00:14:d1:a8:66:1d] ieee80211_alloc_node: inact_reload 2
kernel: wlan0: set WME_AC_BE (chan) [acm 0 aifsn 3 logcwmin 4 logcwmax 6 txop 0]
kernel: wlan0: set WME_AC_BE (bss ) [acm 0 aifsn 3 logcwmin 4 logcwmax 10 txop 
0]
kernel: wlan0: set WME_AC_BK (chan) [acm 0 aifsn 7 logcwmin 4 logcwmax 10 txop 
0]
kernel: wlan0: set WME_AC_BK (bss ) [acm 0 aifsn 7 logcwmin 4 logcwmax 10 txop 
0]
kernel: wlan0: set WME_AC_VI (chan) [acm 0 aifsn 1 logcwmin 3 logcwmax 4 txop 
94]
kernel: wlan0: set WME_AC_VI (bss ) [acm 0 aifsn 2 logcwmin 3 logcwmax 4 txop 
94]
kernel: wlan0: set WME_AC_VO (chan) [acm 0 aifsn 1 logcwmin 2 logcwmax 3 txop 
47]
kernel: wlan0: set WME_AC_VO (bss ) [acm 0 aifsn 2 logcwmin 2 logcwmax 3 txop 
47]
kernel: wlan0: ieee80211_wme_updateparams_locked: WME params updated, cap_info 
0x6
kernel: wlan0: ieee80211_new_state_locked: SCAN -> RUN (nrunning 0 nscanning 0)
kernel: wlan0: ieee80211_newstate_cb: SCAN -> RUN arg -1
kernel: run0: run_update_beacon_cb: updating beacon
kernel: wlan0: ieee80211_beacon_update: traffic 0, enable aggressive mode
kernel: wlan0: update WME_AC_BE (chan+bss) [acm 0 aifsn 2 logcwmin 4 logcwmax 
10 txop 0]
kernel: wlan0: update WME_AC_BE (chan+bss) logcwmin 3
kernel: wlan0: ieee80211_wme_updateparams_locked: WME params updated, cap_info 
0x7
kernel: wlan0: hostap_newstate: SCAN -> RUN (-1)
kernel: wlan0: synchronized with 00:14:d1:a8:66:1d ssid "test" channel 6 start 
0Mb
kernel: wlan0: [00:14:d1:a8:66:1d] ieee80211_node_authorize: inact_reload 20

As you can see, run_update_beacon_cb() is invoked, but at this time the
beacon is already allocated. As the beacon is allocated, run_update_beacon_cb()
invokes ieee80211_beacon_update(). As we know, the ieee80211_beacon_update()
doesn't update the SSID, so the SSID remains untouched.
Nevertheless the changing or hiding/unhiding a SSID seems to be working.
It is possible to explain: the station uses an active scan.
The ieee80211_send_proberesp()/ieee80211_alloc_proberesp() returns the frame,
containing an updated SSID, but AP continues to broadcast beacon with the
outdated data.
The possible solution is to deallocate a beacon on a state change.
I've decided to deallocate a beacon on 'to RUN' state transition.
The additional patch is attached.
I'll do an additional tests later today...

-- 
Alexander Zagrebin
--- /sys/dev/usb/wlan/if_run.c.orig	2011-02-08 09:52:18.994743647 +0300
+++ /sys/dev/usb/wlan/if_run.c		2011-02-08 11:04:17.114484851 +0300
@@ -1793,6 +1793,12 @@ run_newstate(struct ieee80211vap *vap, e
 			sc->runbmap |= bid;
 		}
 
+		if (rvp->beacon_mbuf) {
+			m_freem(rvp->beacon_mbuf);
+			rvp->beacon_mbuf = NULL;
+		}
+
 		switch (vap->iv_opmode) {
 		case IEEE80211_M_HOSTAP:
 		case IEEE80211_M_MBSS:
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Fwd: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Lev Serebryakov
Hello, Karim.
You wrote 8 февраля 2011 г., 6:29:53:

> Precisely, the exact same behavior happens (RX hang) if options
> DEVICE_POLLING is _not_ used in the kernel configuration file. I tried with
> POLLING since someone mentioned that it helped in a case mentioned earlier
> today. Unfortunately for igb with or without polling yields the same rx ring
> filing problem.
  In my case (em(4), not igb(4) but symptoms are VERY similar) POLLING
(both as kernel option AND "ifconfig em0 polling") options leads to
resets (which drops all connections!) AFTER such kernel messages:

em0: Watchdog timeout -- resetting
em0: Queue(0) tdh = 1302, hw tdt = 1265
em0: TX(0) desc avail = 31,Next TX to Clean = 1296

-- 
// Black Lion AKA Lev Serebryakov 

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


ipfw, ipv6 and gif(4)

2011-02-08 Thread Eugene M. Zheganin

 Hi.

I'm running FreeBSD 8.1-STABLE (I had major issues with em(4) on 
8.1-RELEASE, so I had to upgrade this host to more recent STABLE).


I'm using ipv6-over-ipv4 tunnel.

gif0: flags=8051 metric 0 mtu 1280
tunnel inet 89.250.210.67 --> 216.66.80.26
inet6 2001:470:1f08:14c0::2 --> 2001:470:1f08:14c0::1 prefixlen 
128

nd6 options=3
options=1

In order it to work I have to allow ipv4 packets between these two hosts:

(and these are two first rules in the filter)
5  14   1072 allow log ip4 from 89.250.210.67 to 
216.66.80.26 out via vlan104
6  14   1072 allow log ip4 from 216.66.80.26 to 
89.250.210.67 in via vlan104


The thing is, normally (at least in ipv4 world) I would have to allow 
ipencap packets between these hosts (and that's what I did first thing), 
but this configuraion never worked. I've even added 'allow' strings for 
every type of encapsulation from /etc/protocols, just to see their 
counters never changed from zero. Those two rules above were made after 
'ok, let's allow everything just to see in log what does it want' decision.


I want to ask - why ip4 ?

And the log looks even more weird:

%ping6 2001:470:1f08:14c0::1
PING6(56=40+8+8 bytes) 2001:470:1f08:14c0::2 --> 2001:470:1f08:14c0::1
16 bytes from 2001:470:1f08:14c0::1, icmp_seq=0 hlim=64 time=93.917 ms
16 bytes from 2001:470:1f08:14c0::1, icmp_seq=1 hlim=64 time=93.307 ms

Feb  8 13:56:48 ns kernel: ipfw: 5 Accept P:41 89.250.210.67 
216.66.80.26 out via vlan104
Feb  8 13:56:48 ns kernel: ipfw: 6 Accept P:41 216.66.80.26 
89.250.210.67 in via vlan104
Feb  8 13:56:49 ns kernel: ipfw: 5 Accept P:41 89.250.210.67 
216.66.80.26 out via vlan104
Feb  8 13:56:49 ns kernel: ipfw: 6 Accept P:41 216.66.80.26 
89.250.210.67 in via vlan104


As you can see, P:41 is IPv6:

%grep 41 /etc/protocols
ipv641  IPV6# ipv6

And, of course, ipfw doesn't allow me to create the rules it is actually 
logging:


%ipfw add 7 allow 41 from 216.66.80.26 to 89.250.210.67 in via vlan104
ipfw: bad address "216.66.80.26"

Do I misunderstand the concept, or is it how it really should look ?

Thanks.
Eugene.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: if_run in hostap mode: issue with stations in the power save mode

2011-02-08 Thread Bernhard Schmidt
On Tuesday, February 08, 2011 09:24:29 Alexander Zagrebin wrote:
> Hi!
> 
> On 07.02.2011 09:11:02 +0100, Bernhard Schmidt wrote:
> > For example, if you call 'ifconfig wlan0 ssid ' the new
> > ssid is passed over using a IOCTL. It would be interesting to know
> > what function in net80211 are called regarding beacon updates and
> > which of those call into the run driver. Ultimately it's about
> > figuring out if special handling for such cases are required and
> > if so, how to do it.
> 
> I've added a debug output on allocation, changing and deallocation of
> a beacon into if_run.c and tried to change SSID while the
> net.wlan.0.debug is -1. Here is the log contents:
> 
> kernel: wlan0: ieee80211_init
> kernel: wlan0: start running, 1 vaps running
> kernel: wlan0: ieee80211_new_state_locked: RUN -> SCAN (nrunning 0
> nscanning 0) kernel: wlan0: ieee80211_newstate_cb: RUN -> INIT arg 0
> kernel: wlan0: hostap_newstate: RUN -> INIT (0)
> kernel: wlan0: node_reclaim: remove
> 0xff8003bd7000<00:14:d1:a8:66:1d> from station table, refcnt 1
> kernel: wlan0: ieee80211_alloc_node
> 0xff8004eae000<00:14:d1:a8:66:1d> in station table kernel:
> wlan0: [00:14:d1:a8:66:1d] ieee80211_alloc_node: inact_reload 2
> kernel: wlan0: ieee80211_newstate_cb: INIT -> SCAN arg 0
> kernel: wlan0: hostap_newstate: INIT -> SCAN (0)
> kernel: wlan0: ieee80211_create_ibss: creating HOSTAP on channel 6
> kernel: wlan0: ieee80211_alloc_node
> 0xff8003bd7000<00:14:d1:a8:66:1d> in station table kernel:
> kernel: wlan0: [00:14:d1:a8:66:1d] ieee80211_alloc_node: inact_reload
> 2 kernel: wlan0: set WME_AC_BE (chan) [acm 0 aifsn 3 logcwmin 4
> logcwmax 6 txop 0] kernel: wlan0: set WME_AC_BE (bss ) [acm 0 aifsn
> 3 logcwmin 4 logcwmax 10 txop 0] kernel: wlan0: set WME_AC_BK (chan)
> [acm 0 aifsn 7 logcwmin 4 logcwmax 10 txop 0] kernel: wlan0: set
> WME_AC_BK (bss ) [acm 0 aifsn 7 logcwmin 4 logcwmax 10 txop 0]
> kernel: wlan0: set WME_AC_VI (chan) [acm 0 aifsn 1 logcwmin 3
> logcwmax 4 txop 94] kernel: wlan0: set WME_AC_VI (bss ) [acm 0 aifsn
> 2 logcwmin 3 logcwmax 4 txop 94] kernel: wlan0: set WME_AC_VO (chan)
> [acm 0 aifsn 1 logcwmin 2 logcwmax 3 txop 47] kernel: wlan0: set
> WME_AC_VO (bss ) [acm 0 aifsn 2 logcwmin 2 logcwmax 3 txop 47]
> kernel: wlan0: ieee80211_wme_updateparams_locked: WME params
> updated, cap_info 0x6 kernel: wlan0: ieee80211_new_state_locked:
> SCAN -> RUN (nrunning 0 nscanning 0) kernel: wlan0:
> ieee80211_newstate_cb: SCAN -> RUN arg -1
> kernel: run0: run_update_beacon_cb: updating beacon
> kernel: wlan0: ieee80211_beacon_update: traffic 0, enable aggressive
> mode kernel: wlan0: update WME_AC_BE (chan+bss) [acm 0 aifsn 2
> logcwmin 4 logcwmax 10 txop 0] kernel: wlan0: update WME_AC_BE
> (chan+bss) logcwmin 3
> kernel: wlan0: ieee80211_wme_updateparams_locked: WME params updated,
> cap_info 0x7 kernel: wlan0: hostap_newstate: SCAN -> RUN (-1)
> kernel: wlan0: synchronized with 00:14:d1:a8:66:1d ssid "test"
> channel 6 start 0Mb kernel: wlan0: [00:14:d1:a8:66:1d]
> ieee80211_node_authorize: inact_reload 20
> 
> As you can see, run_update_beacon_cb() is invoked, but at this time
> the beacon is already allocated. As the beacon is allocated,
> run_update_beacon_cb() invokes ieee80211_beacon_update(). As we
> know, the ieee80211_beacon_update() doesn't update the SSID, so the
> SSID remains untouched.
> Nevertheless the changing or hiding/unhiding a SSID seems to be
> working. It is possible to explain: the station uses an active scan.
> The ieee80211_send_proberesp()/ieee80211_alloc_proberesp() returns
> the frame, containing an updated SSID, but AP continues to broadcast
> beacon with the outdated data.
> The possible solution is to deallocate a beacon on a state change.
> I've decided to deallocate a beacon on 'to RUN' state transition.
> The additional patch is attached.
> I'll do an additional tests later today...

Thank you. That's what I expected actually, when we are going through 
state changes (RUN -> ... -> RUN) net80211 expects us to throw most 
knowledge we have aways. This seems to be safest solution. When the 
beacon mbuf is completely thrown away and created from scratch we can be 
absolutely sure we handled all cases.

-- 
Bernhard
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: if_run in hostap mode: issue with stations in the power save mode

2011-02-08 Thread Bernhard Schmidt
On Tuesday, February 08, 2011 02:18:30 PseudoCylon wrote:
> - Original Message 
> 
> > From: Bernhard Schmidt 
> > To: PseudoCylon 
> > Cc: Alexander Zagrebin ; freebsd-net@freebsd.org
> > Sent: Sun, February 6, 2011 3:42:43 AM
> > Subject: Re: if_run in hostap mode: issue with stations in the
> > power save mode
> 
> Afaik iwn(4) doesn't use PS, never got  around implementing that.
> 
> > > I'd like to move ieee80211_beacon_alloc()  into iv_vap_alloc().
> > > Then we don't need to test beacon_mbuf == NULL in 
> > > run_update_beacon_cb(), and there is already switch we can use
> > > for  conditionally alloc mem.
> > 
> > Sounds fine with we.
> 
> Oops, there is switch before malloc vap. the test is still
> in run_update_beacon_cb()
> 
> > Can  I talk you into integrating that into Alexander's patch?
> 
> The patch is attached. (diff to HEAD) Bit long, just because there is
> a couple of  new call back functions to avoid LOR.

Thank you!

I've combined both patches (see attachment), if I get an ACK from both 
of you I'll try get this into the tree ASAP.

-- 
Bernhard
Index: sys/dev/usb/wlan/if_runvar.h
===
--- sys/dev/usb/wlan/if_runvar.h	(revision 218367)
+++ sys/dev/usb/wlan/if_runvar.h	(working copy)
@@ -121,6 +121,7 @@ struct run_cmdq {
 struct run_vap {
 	struct ieee80211vap vap;
 	struct ieee80211_beacon_offsets bo;
+	struct mbuf			*beacon_mbuf;
 
 	int (*newstate)(struct ieee80211vap *,
 enum ieee80211_state, int);
Index: sys/dev/usb/wlan/if_run.c
===
--- sys/dev/usb/wlan/if_run.c	(revision 218367)
+++ sys/dev/usb/wlan/if_run.c	(working copy)
@@ -388,6 +388,7 @@ static void	run_scan_end(struct ieee80211com *);
 static void	run_update_beacon(struct ieee80211vap *, int);
 static void	run_update_beacon_cb(void *);
 static void	run_updateprot(struct ieee80211com *);
+static void	run_updateprot_cb(void *);
 static void	run_usb_timeout_cb(void *);
 static void	run_reset_livelock(struct run_softc *);
 static void	run_enable_tsf_sync(struct run_softc *);
@@ -398,6 +399,7 @@ static void	run_set_leds(struct run_softc *, uint1
 static void	run_set_bssid(struct run_softc *, const uint8_t *);
 static void	run_set_macaddr(struct run_softc *, const uint8_t *);
 static void	run_updateslot(struct ifnet *);
+static void	run_updateslot_cb(void *);
 static void	run_update_mcast(struct ifnet *);
 static int8_t	run_rssi2dbm(struct run_softc *, uint8_t, uint8_t);
 static void	run_update_promisc_locked(struct ifnet *);
@@ -674,7 +676,7 @@ run_attach(device_t self)
 	ic->ic_set_channel = run_set_channel;
 	ic->ic_node_alloc = run_node_alloc;
 	ic->ic_newassoc = run_newassoc;
-	//ic->ic_updateslot = run_updateslot;
+	ic->ic_updateslot = run_updateslot;
 	ic->ic_update_mcast = run_update_mcast;
 	ic->ic_wme.wme_update = run_wme_update;
 	ic->ic_raw_xmit = run_raw_xmit;
@@ -856,6 +858,9 @@ run_vap_delete(struct ieee80211vap *vap)
 
 	RUN_LOCK(sc);
 
+	m_freem(rvp->beacon_mbuf);
+	rvp->beacon_mbuf = NULL;
+
 	rvp_id = rvp->rvp_id;
 	sc->ratectl_run &= ~(1 << rvp_id);
 	sc->rvp_bmap &= ~(1 << rvp_id);
@@ -1790,6 +1795,9 @@ run_newstate(struct ieee80211vap *vap, enum ieee80
 			sc->runbmap |= bid;
 		}
 
+		m_freem(rvp->beacon_mbuf);
+		rvp->beacon_mbuf = NULL;
+
 		switch (vap->iv_opmode) {
 		case IEEE80211_M_HOSTAP:
 		case IEEE80211_M_MBSS:
@@ -3901,8 +3909,29 @@ run_update_beacon(struct ieee80211vap *vap, int it
 {
 	struct ieee80211com *ic = vap->iv_ic;
 	struct run_softc *sc = ic->ic_ifp->if_softc;
+	struct run_vap *rvp = RUN_VAP(vap);
+	int mcast = 0;
 	uint32_t i;
 
+	KASSERT(vap != NULL, ("no beacon"));
+
+	switch (item) {
+	case IEEE80211_BEACON_ERP:
+		run_updateslot(ic->ic_ifp);
+		break;
+	case IEEE80211_BEACON_HTINFO:
+		run_updateprot(ic);
+		break;
+	case IEEE80211_BEACON_TIM:
+		mcast = 1;	/*TODO*/
+		break;
+	default:
+		break;
+	}
+
+	setbit(rvp->bo.bo_flags, item);
+	ieee80211_beacon_update(vap->iv_bss, &rvp->bo, rvp->beacon_mbuf, mcast);
+
 	i = RUN_CMDQ_GET(&sc->cmdq_store);
 	DPRINTF("cmdq_store=%d\n", i);
 	sc->cmdq[i].func = run_update_beacon_cb;
@@ -3916,6 +3945,7 @@ static void
 run_update_beacon_cb(void *arg)
 {
 	struct ieee80211vap *vap = arg;
+	struct run_vap *rvp = RUN_VAP(vap);
 	struct ieee80211com *ic = vap->iv_ic;
 	struct run_softc *sc = ic->ic_ifp->if_softc;
 	struct rt2860_txwi txwi;
@@ -3925,8 +3955,17 @@ run_update_beacon_cb(void *arg)
 	if (vap->iv_bss->ni_chan == IEEE80211_CHAN_ANYC)
 		return;
 
-	if ((m = ieee80211_beacon_alloc(vap->iv_bss, &RUN_VAP(vap)->bo)) == NULL)
-	return;
+	/*
+	 * No need to call ieee80211_beacon_update(), run_update_beacon()
+	 * is taking care of apropriate calls.
+	 */
+	if (rvp->beacon_mbuf == NULL) {
+		rvp->beacon_mbuf = ieee80211_beacon_alloc(vap->iv_bss,
+		&rvp->bo);
+		if (rvp->beacon_mbuf == NULL)
+			return;
+	}
+	m = rv

Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Michael Tüxen
On Feb 8, 2011, at 10:10 AM, Lev Serebryakov wrote:

> Hello, Karim.
> You wrote 8 февраля 2011 г., 6:29:53:
> 
>> Precisely, the exact same behavior happens (RX hang) if options
>> DEVICE_POLLING is _not_ used in the kernel configuration file. I tried with
>> POLLING since someone mentioned that it helped in a case mentioned earlier
>> today. Unfortunately for igb with or without polling yields the same rx ring
>> filing problem.
>  In my case (em(4), not igb(4) but symptoms are VERY similar) POLLING
> (both as kernel option AND "ifconfig em0 polling") options leads to
> resets (which drops all connections!) AFTER such kernel messages:
> 
> em0: Watchdog timeout -- resetting
> em0: Queue(0) tdh = 1302, hw tdt = 1265
> em0: TX(0) desc avail = 31,Next TX to Clean = 1296
Can you apply the attached patch and report what
the output for rx_nxt_refresh and rx_nxt_check is?

Best regards
Michael



patch
Description: Binary data

> 
> -- 
> // Black Lion AKA Lev Serebryakov 
> 
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Michael Tüxen
On Feb 8, 2011, at 4:29 AM, Karim Fodil-Lemelin wrote:

> 2011/2/7 Pyun YongHyeon 
> 
>> On Mon, Feb 07, 2011 at 09:21:45PM -0500, Karim Fodil-Lemelin wrote:
>>> 2011/2/7 Pyun YongHyeon 
>>> 
 On Mon, Feb 07, 2011 at 05:33:47PM -0500, Karim Fodil-Lemelin wrote:
> Subject: Re: igb driver tx hangs when out of mbuf clusters
> 
>> To: Lev Serebryakov 
>> Cc: freebsd-net@freebsd.org
>> 
>> 
>> 2011/2/7 Lev Serebryakov 
>> 
>> Hello, Karim.
>>> You wrote 7 февраля 2011 г., 19:58:04:
>>> 
>>> 
 The issue is with the igb driver from 7.4 RC3 r218406. If the
>> driver
>>> runs
 out of mbuf clusters it simply stops receiving even after the
 clusters
>>> have
 been freed.
>>>  It looks like my problems with em0 (see thread "em0 hangs
>> without
>>> any messages like "Watchdog timeout", only down/up reset it.")...
>>> Codebase for em and igb is somewhat common...
>>> 
>>> --
>>> // Black Lion AKA Lev Serebryakov 
>>> 
>>> I agree.
>> 
>> Do you get missed packets in mac_stats (sysctl dev.em | grep
>> missed)?
>> 
>> I might not have mentioned but I can also 'fix' the problem by
>> doing
>> ifconfig igb0 down/up.
>> 
>> I will try using POLLING to 'automatize' the reset as you mentioned
>> in
 your
>> thread.
>> 
>> Karim.
>> 
>> 
> Follow up on tests with POLLING: The problem is still occurring
>> although
 it
> takes more time ... Outputs of sysctl dev.igb0 and netstat -m will
 follow:
> 
> 9219/99426/108645 mbufs in use (current/cache/total)
> 9217/90783/10/10 mbuf clusters in use
>> (current/cache/total/max)
 
 Do you see network processes are stuck in keglim state? If you see
 that I think that's not trivial to solve. You wouldn't even kill
 that process if it is under keglim state unless some more mbuf
 clusters are freed from other places.
 
>>> 
>>> No keglim state, here is a snapshot of top -SH while the problem is
>>> happening:
>>> 
>>>   12 root  171 ki31 0K 8K CPU5   5  19:27 100.00% idle:
>>> cpu5
>>>   10 root  171 ki31 0K 8K CPU7   7  19:26 100.00% idle:
>>> cpu7
>>>   14 root  171 ki31 0K 8K CPU3   3  19:25 100.00% idle:
>>> cpu3
>>>   11 root  171 ki31 0K 8K CPU6   6  19:25 100.00% idle:
>>> cpu6
>>>   13 root  171 ki31 0K 8K CPU4   4  19:24 100.00% idle:
>>> cpu4
>>>   15 root  171 ki31 0K 8K CPU2   2  19:22 100.00% idle:
>>> cpu2
>>>   16 root  171 ki31 0K 8K CPU1   1  19:18 100.00% idle:
>>> cpu1
>>>   17 root  171 ki31 0K 8K RUN0  19:12 100.00% idle:
>>> cpu0
>>>   18 root  -32- 0K 8K WAIT   6   0:04  0.10% swi4:
>>> clock s
>>>   20 root  -44- 0K 8K WAIT   4   0:08  0.00% swi1:
>> net
>>>   29 root  -68- 0K 8K -  0   0:02  0.00% igb0
>> que
>>>   35 root  -68- 0K 8K -  2   0:02  0.00% em1
>> taskq
>>>   28 root  -68- 0K 8K WAIT   5   0:01  0.00% irq256:
>>> igb0
>>> 
>>> keep in mind that num_queues has been forced to 1.
>>> 
>>> 
 
 I think both igb(4) and em(4) pass received frame to upper stack
 before allocating new RX buffer. If driver fails to allocate new RX
 buffer driver will try to refill RX buffers in next run. Under
 extreme resource shortage case, this situation can produce no more
 RX buffers in RX descriptor ring and this will take the box out of
 network. Other drivers avoid that situation by allocating new RX
 buffer before passing received frame to upper stack. If RX buffer
 allocation fails driver will just reuse old RX buffer without
 passing received frame to upper stack. That does not completely
 solve the keglim issue though. I think you should have enough mbuf
 cluters to avoid keglim.
 
 However the output above indicates you have enough free mbuf
 clusters. So I guess igb(4) encountered zero available RX buffer
 situation in past but failed to refill the RX buffer again. I guess
 driver may be able to periodically check available RX buffers.
 Jack may have better idea if this was the case.(CCed)
 
>>> 
>>> That is exactly the pattern. The driver runs out of clusters but they
>>> eventually get consumed and freed although the driver refuses to process
>> any
>>> new frames. It is, on the other hand, perfectly capable of sending out
>>> packets.
>>> 
>> 
>> Ok, this clearly indicates igb(4) failed to refill RX buffers since
>> you can still send frames. I'm not sure whether igb(4) controllers
>> could be configured to generate no RX buffer interrupts but that
>> interrupt would be better suited to trigger RX refilling than timer
>> based refilling. Since igb(4) keeps track of available RX buffers,
>> igb(4) can selectively enable that i

Re: ipfw, ipv6 and gif(4)

2011-02-08 Thread Hajimu UMEMOTO
Hi,

> On Tue, 08 Feb 2011 14:05:38 +0500
> "Eugene M. Zheganin"  said:

emz> As you can see, P:41 is IPv6:

emz> %grep 41 /etc/protocols
emz> ipv641  IPV6# ipv6

emz> And, of course, ipfw doesn't allow me to create the rules it is
emz> actually logging:

emz> %ipfw add 7 allow 41 from 216.66.80.26 to 89.250.210.67 in via vlan104
emz> ipfw: bad address "216.66.80.26"

emz> Do I misunderstand the concept, or is it how it really should look ?

Something like `pass ip4 from any to any proto ipv6' should work for
you.

Sincerely,

--
Hajimu UMEMOTO @ Internet Mutual Aid Society Yokohama, Japan
u...@mahoroba.org  ume@{,jp.}FreeBSD.org
http://www.imasy.org/~ume/
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


IPv6 Extension Headers

2011-02-08 Thread Colin O'Keeffe
Hi,

I'm looking for some guidance on implementing extension headers in the kernel 
for outgoing packets and processing incoming packets. Is anybody available to 
discuss it with me (on or off the mailing list) to help me get the ball rolling.

Thanks___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: divert rewrite

2011-02-08 Thread rozhuk . im
> -Original Message-
> From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-
> n...@freebsd.org] On Behalf Of Sergey Matveychuk
> Sent: Monday, February 07, 2011 11:37 PM
> To: Julian Elischer
> Cc: Ivo Vachkov; FreeBSD Net
> Subject: Re: divert rewrite
> 
> 06.02.2011 4:42, Julian Elischer wrote:
> > On 2/5/11 4:09 PM, Ivo Vachkov wrote:
> >> Hello,
> >>
> >> How can I help?
> >
> > if you have ipv6 connectivity and experience, I have no experience or
> > connectivity, with it so
> > I'll be coding blind and will need a tester.
> > If you have an application for IPV6 testing that would be even
> better.
> > Divert is often used for NAT but that doesn't seem very useful for
> IPv6 and
> > natd doesn't support it anyhow.
> 
> Object :)
> Divert is really useful way to get packets from firewall to userspace,
> analyse or process them some way and put them back. Really I see no
> other way for this for IPv6. I've tried ng_socket+ng_nat but there is
> no
> easy way to put a packet back in firewall.
> 
> I'm very interested in the process. And I'm ready to help in testing.

Did you try ng_ether + ng_ksocket?
It can translate Ethernet frames incapsulated to udp to user space receiver.





___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Karim Fodil-Lemelin
> 2011/2/8 Michael Tüxen 
>
>> On Feb 8, 2011, at 4:29 AM, Karim Fodil-Lemelin wrote:
>>
>> > 2011/2/7 Pyun YongHyeon 
>> >
>> >> On Mon, Feb 07, 2011 at 09:21:45PM -0500, Karim Fodil-Lemelin wrote:
>> >>> 2011/2/7 Pyun YongHyeon 
>> >>>
>>  On Mon, Feb 07, 2011 at 05:33:47PM -0500, Karim Fodil-Lemelin wrote:
>> > Subject: Re: igb driver tx hangs when out of mbuf clusters
>> >
>> >> To: Lev Serebryakov 
>> >> Cc: freebsd-net@freebsd.org
>> >>
>> >>
>> >> 2011/2/7 Lev Serebryakov 
>> >>
>> >> Hello, Karim.
>> >>> You wrote 7 февраля 2011 г., 19:58:04:
>> >>>
>> >>>
>>  The issue is with the igb driver from 7.4 RC3 r218406. If the
>> >> driver
>> >>> runs
>>  out of mbuf clusters it simply stops receiving even after the
>>  clusters
>> >>> have
>>  been freed.
>> >>>  It looks like my problems with em0 (see thread "em0 hangs
>> >> without
>> >>> any messages like "Watchdog timeout", only down/up reset it.")...
>> >>> Codebase for em and igb is somewhat common...
>> >>>
>> >>> --
>> >>> // Black Lion AKA Lev Serebryakov 
>> >>>
>> >>> I agree.
>> >>
>> >> Do you get missed packets in mac_stats (sysctl dev.em | grep
>> >> missed)?
>> >>
>> >> I might not have mentioned but I can also 'fix' the problem by
>> >> doing
>> >> ifconfig igb0 down/up.
>> >>
>> >> I will try using POLLING to 'automatize' the reset as you mentioned
>> >> in
>>  your
>> >> thread.
>> >>
>> >> Karim.
>> >>
>> >>
>> > Follow up on tests with POLLING: The problem is still occurring
>> >> although
>>  it
>> > takes more time ... Outputs of sysctl dev.igb0 and netstat -m will
>>  follow:
>> >
>> > 9219/99426/108645 mbufs in use (current/cache/total)
>> > 9217/90783/10/10 mbuf clusters in use
>> >> (current/cache/total/max)
>> 
>>  Do you see network processes are stuck in keglim state? If you see
>>  that I think that's not trivial to solve. You wouldn't even kill
>>  that process if it is under keglim state unless some more mbuf
>>  clusters are freed from other places.
>> 
>> >>>
>> >>> No keglim state, here is a snapshot of top -SH while the problem is
>> >>> happening:
>> >>>
>> >>>   12 root  171 ki31 0K 8K CPU5   5  19:27 100.00%
>> idle:
>> >>> cpu5
>> >>>   10 root  171 ki31 0K 8K CPU7   7  19:26 100.00%
>> idle:
>> >>> cpu7
>> >>>   14 root  171 ki31 0K 8K CPU3   3  19:25 100.00%
>> idle:
>> >>> cpu3
>> >>>   11 root  171 ki31 0K 8K CPU6   6  19:25 100.00%
>> idle:
>> >>> cpu6
>> >>>   13 root  171 ki31 0K 8K CPU4   4  19:24 100.00%
>> idle:
>> >>> cpu4
>> >>>   15 root  171 ki31 0K 8K CPU2   2  19:22 100.00%
>> idle:
>> >>> cpu2
>> >>>   16 root  171 ki31 0K 8K CPU1   1  19:18 100.00%
>> idle:
>> >>> cpu1
>> >>>   17 root  171 ki31 0K 8K RUN0  19:12 100.00%
>> idle:
>> >>> cpu0
>> >>>   18 root  -32- 0K 8K WAIT   6   0:04  0.10% swi4:
>> >>> clock s
>> >>>   20 root  -44- 0K 8K WAIT   4   0:08  0.00% swi1:
>> >> net
>> >>>   29 root  -68- 0K 8K -  0   0:02  0.00% igb0
>> >> que
>> >>>   35 root  -68- 0K 8K -  2   0:02  0.00% em1
>> >> taskq
>> >>>   28 root  -68- 0K 8K WAIT   5   0:01  0.00%
>> irq256:
>> >>> igb0
>> >>>
>> >>> keep in mind that num_queues has been forced to 1.
>> >>>
>> >>>
>> 
>>  I think both igb(4) and em(4) pass received frame to upper stack
>>  before allocating new RX buffer. If driver fails to allocate new RX
>>  buffer driver will try to refill RX buffers in next run. Under
>>  extreme resource shortage case, this situation can produce no more
>>  RX buffers in RX descriptor ring and this will take the box out of
>>  network. Other drivers avoid that situation by allocating new RX
>>  buffer before passing received frame to upper stack. If RX buffer
>>  allocation fails driver will just reuse old RX buffer without
>>  passing received frame to upper stack. That does not completely
>>  solve the keglim issue though. I think you should have enough mbuf
>>  cluters to avoid keglim.
>> 
>>  However the output above indicates you have enough free mbuf
>>  clusters. So I guess igb(4) encountered zero available RX buffer
>>  situation in past but failed to refill the RX buffer again. I guess
>>  driver may be able to periodically check available RX buffers.
>>  Jack may have better idea if this was the case.(CCed)
>> 
>> >>>
>> >>> That is exactly the pattern. The driver runs out of clusters but they
>> >>> eventually get consumed and freed although the driver refuses to
>> process
>> >> any
>> >>> new frames. It is, on the other hand, perfectly capable of sending out
>> >>> 

Re: divert rewrite

2011-02-08 Thread Sergey Matveychuk

07.02.2011 18:36, Sergey Matveychuk wrote:

06.02.2011 4:42, Julian Elischer wrote:

On 2/5/11 4:09 PM, Ivo Vachkov wrote:

Hello,

How can I help?


if you have ipv6 connectivity and experience, I have no experience or
connectivity, with it so
I'll be coding blind and will need a tester.
If you have an application for IPV6 testing that would be even better.
Divert is often used for NAT but that doesn't seem very useful for
IPv6 and
natd doesn't support it anyhow.


Object :)
Divert is really useful way to get packets from firewall to userspace,
analyse or process them some way and put them back. Really I see no
other way for this for IPv6. I've tried ng_socket+ng_nat but there is no
easy way to put a packet back in firewall.


Oops, I meant ng_socket+ng_ipfw here.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: divert rewrite

2011-02-08 Thread Sergey Matveychuk

08.02.2011 19:08, rozhuk...@gmail.com wrote:

Did you try ng_ether + ng_ksocket?
It can translate Ethernet frames incapsulated to udp to user space receiver.


The idea is catch packets from firewall (ng_ipfw, ng_nat was mentioned 
by mistake) and pass them to user space module that do some processing 
and puts back the packets into firewall (for rules with `diverted' keyword).


It works now for IPv4 with `divert' and doesn't with IPv6.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


TCP can advertise a really huge window

2011-02-08 Thread John Baldwin
This is a very bizarre edge case, so bear with me.  I was debugging an edge 
case at work recently that occurred when the socket buffer was filled up 
exactly (i.e. sbspace(&so->so_rcv) == 0).  In TCP terms, this would be when 
rcv_nxt == rcv_adv.  To simulate the real workload I had a very fast writer 
blasting over lo0 to a slow reader but had used a small buffer and turned off 
window scaling (as I had to fill the entire socket buffer, I was chasing an 
off-by-1 bug).  However, I ended up with some bizarre behavior.

I think it is less confusing to describe the sequence of events that I now 
know happened than how I figured this out, so here goes.

- Assume we have advertised a window size of N which corresponds exactly to
  sbspace(&so->so_rcv).
- The remote peer sends a packet of length N filling our window.  We respond
  with a zero-window ACK.  This advances rcv_nxt to == rcv_adv, but it does
  not grow rcv_adv because sbspace() is currently 0.
- The userland app very slowly drains data from the socket buffer.  However,
  the calls to tcp_usr_recvd() do not trigger a window update because in this
  case the link is over lo0 which has a relatively large t_maxseg (about 14k)
  and this condition in tcp_output() is not met:

if (adv >= (long) (2 * tp->t_maxseg))
goto send;
if (2 * adv >= (long) so->so_rcv.sb_hiwat)
goto send;

- A timer at the remote peer expires and it sends a window probe with one
  byte of data.  Since userland has read some data (just not 2 * MSS), we
  accept this packet.  However, receiving this packet moves rcv_nxt += 1,
  so rcv_nxt is now > rcv_adv.
- We call tcp_output() to ACK the window probe and as part of this calculate
  the receive window to advertise here:

if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
recwin < (long)tp->t_maxseg)
recwin = 0;
if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
recwin = (long)TCP_MAXWIN << tp->rcv_scale;

The "surprise" kicks in on the second conditional.  The problem is that 
rcv_adv - rcv_nxt is now equal to (uint32_t)-1.  On a 32-bit machine the cast 
to (long) effectively just makes this value signed and thus -1.  On a 64-bit 
machine you actually end up with a ginormous value of 2^32 - 1, or a 4GB 
window (minus a byte).  The third conditional truncates that to the maximum 
window we can advertise, but this value may be larger than the actual space in 
the socket buffer.  The remote peer now has a huge window to throw data into.

At work this proved disastrous.  I'm not sure if there are any practical 
concerns.  This is the patch I'm using as a fix:

Index: tcp_output.c
===
--- tcp_output.c(revision 215582)
+++ tcp_output.c(working copy)
@@ -928,7 +928,8 @@
if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
recwin < (long)tp->t_maxseg)
recwin = 0;
-   if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
+   if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
+   recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
recwin = (long)TCP_MAXWIN << tp->rcv_scale;

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


A small TCP bug: excessive duplicate ACKs

2011-02-08 Thread John Baldwin
One thing I've noticed at work is that if a receiver's socket buffer fills and 
the receiver then drains the buffer all at once, we send a lot of duplicate 
ACKs.  I narrowed this down to being due to the abnormally high window scaling 
factor we have.  We set kern.ipc.maxsockbuf to 314572800 which results in a 
window scaling factor of 8k.  This interacts poorly with the logic that 
decides whether or not to force a window update in tcp_output():

/*
 * Compare available window to amount of window
 * known to peer (as advertised window less
 * next expected input).  If the difference is at least two
 * max size segments, or at least 50% of the maximum possible
 * window, then want to send a window update to peer.
 * Skip this if the connection is in T/TCP half-open state.
 * Don't send pure window updates when the peer has closed
 * the connection and won't ever send more data.
 */
if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
!TCPS_HAVERCVDFIN(tp->t_state)) {
/*
 * "adv" is the amount we can increase the window,
 * taking into account that we are limited by
 * TCP_MAXWIN << tp->rcv_scale.
 */
long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
(tp->rcv_adv - tp->rcv_nxt);

if (adv >= (long) (2 * tp->t_maxseg))
goto send;
if (2 * adv >= (long) so->so_rcv.sb_hiwat)
goto send;
}

Specifically, we can send a duplicate ACK when (2 * tp->t_maxseg) or
(so->so_rcv.sb_hiwat / 2) are less than the window scaling factor.  I have a 
test app that you can run against a TCP chargen service from inetd to 
reproduce it.  I also have two TCP dumps from before and after.  The patch I'm 
using to fix this is below (I could rework it to not use the extra goto 
perhaps, but went with a simple hack to minimize reindenting for now):

Index: tcp_output.c
===
--- tcp_output.c(revision 217650)
+++ tcp_output.c(working copy)
@@ -560,11 +560,19 @@
long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
(tp->rcv_adv - tp->rcv_nxt);
 
+   /* 
+* If the new window size ends up being the same as the old
+* size when it is scaled, then don't force a window update.
+*/
+   if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale ==
+   (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale)
+   goto dontupdate;
if (adv >= (long) (2 * tp->t_maxseg))
goto send;
if (2 * adv >= (long) so->so_rcv.sb_hiwat)
goto send;
}
+dontupdate:
 
/*
 * Send if we owe the peer an ACK, RST, SYN, or urgent data.  ACKNOW

Note that if the ACK sequence number has moved then I think other checks in 
tcp_output() will still force an ACK packet out, so I don't think this will 
cause us to miss on sending ACKs to the peers.

You can find the test app source (tcpslow.c) and the dumps at 
http://people.freebsd.org/~jhb/tcpslow/

If you look at tcp_bad.out, the receiver stops reading data the receiver's 
socket buffer fills up around packet 72 or so.  The receiver wakes up at 
packet 88 and drains the buffer causing a small storm of window updates.  
However, due to the scaling factor, it actually sends duplicate ACKs in 
batches of threes (3 ACKs for 8k window, 3 ACKs for 16k window, etc.).  This 
happens each time the receiver wakes up and drains a full socket buffer.  The 
tcp_good.out dump shows the stream with the patch applied.  A similar event of 
the receiver draining a full buffer starts at packet 83 and it sends a single 
ACK for each "real" window update.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: divert rewrite

2011-02-08 Thread Julian Elischer

08.02.2011 19:08, rozhuk...@gmail.com wrote:

Did you try ng_ether + ng_ksocket?
It can translate Ethernet frames incapsulated to udp to user space 
receiver.


The idea is catch packets from firewall (ng_ipfw, ng_nat was 
mentioned by mistake) and pass them to user space module that do 
some processing and puts back the packets into firewall (for rules 
with `diverted' keyword).


yes, however did you try the ipfw netgraph keyword and the ng_ipfw  node?
I have also been wondering it it might not make sense to simpply 
replavce the diver code with
a netgraph equivalent..  Using the ng_ipfw node one can almost do it 
with no changes as it is.




It works now for IPv4 with `divert' and doesn't with IPv6.


yes, I'm pondering the right fix for that..


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"



___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


TCP connections stuck in persist state

2011-02-08 Thread John Baldwin
I ran into a problem recently where a TCP socket seemed to never exit persist 
mode.  What would happen is that the sender was blasting data faster than the 
receiver could receive it.  When the receiver read some data, the sender would 
start sending again and everything would resume.  However, after 3-4 instances 
of this, the sender would decide to not resume sending data when the receiver 
opened the window.  Instead, it would slowly send a byte every few seconds via 
the persist timer even though the receiver was advertising a 64k window when 
it ACKd each of the window probes that was sent by the sender.

I dug around in kgdb and found that both snd_cwnd and snd_ssthresh were set to 
0 on the sender side.  I think this means that the send window is effectively 
permamently stuck at zero as a result of this.  (The tcpcb is also 
IN_FASTRECOVERY() on the sender side, probably from the storm of duplicate 
acks from the receiver when it sends a bunch of window updates (see my earlier
e-mail to net@ for the source of duplicate ACKs.)  Anyway, I think that this
code in tcp_input() is what keeps the window at zero:

/*
 * If the congestion window was inflated to account
 * for the other side's cached packets, retract it.
 */
if (tcp_do_newreno || (tp->t_flags & TF_SACK_PERMIT)) {
if (IN_FASTRECOVERY(tp)) {
if (SEQ_LT(th->th_ack, tp->snd_recover)) {
if (tp->t_flags & TF_SACK_PERMIT)
tcp_sack_partialack(tp, th);
else
tcp_newreno_partial_ack(tp, th);
} else {
/*
 * Out of fast recovery.
 * Window inflation should have left us
 * with approximately snd_ssthresh
 * outstanding data.
 * But in case we would be inclined to
 * send a burst, better to do it via
 * the slow start mechanism.
 */
KASSERT(tp->snd_ssthresh != 0,
("using bogus snd_ssthresh"));
if (SEQ_GT(th->th_ack +
tp->snd_ssthresh,
   tp->snd_max))
tp->snd_cwnd = tp->snd_max -
th->th_ack +
tp->t_maxseg;
else
tp->snd_cwnd = tp->snd_ssthresh;
}
}

Specifically, since snd_recover and snd_una seem to keep advancing in
lock-step with each window update, I think it ends up falling down to the
last statement each time where snd_cwnd = snd_ssthresh thus keeping
snd_cwnd at 0.  This then causes a zero send window in tcp_output():

sendwin = min(tp->snd_wnd, tp->snd_cwnd);
sendwin = min(sendwin, tp->snd_bwnd);

Now, looking at the code, I can see no way that snd_ssthresh should ever be 
zero.  It seems to always be calculated from some number of segments times
t_maxseg.  The one exception to this rule is when it is restored from 
snd_ssthresh_prev due to a bad retransmit (this example is from tcp_input()):

/*
 * If we just performed our first retransmit, and the ACK
 * arrives within our recovery window, then it was a mistake
 * to do the retransmit in the first place.  Recover our
 * original cwnd and ssthresh, and proceed to transmit where
 * we left off.
 */
if (tp->t_rxtshift == 1 && (int)(ticks - tp->t_badrxtwin) < 0) {
++tcpstat.tcps_sndrexmitbad;
tp->snd_cwnd = tp->snd_cwnd_prev;
tp->snd_ssthresh = tp->snd_ssthresh_prev;
tp->snd_recover = tp->snd_recover_prev;
if (tp->t_flags & TF_WASFRECOVERY)
ENTER_FASTRECOVERY(tp);
tp->snd_nxt = tp->snd_max;
tp->t_badrxtwin = 0;/* XXX probably not required */
}

So then my working theory is that somehow, snd_ssthresh_prev is being used
when it hasn't been initialized.  I then checked 'ticks' on my host and found 
that it had wrapped 

Re: divert rewrite

2011-02-08 Thread Sergey Matveychuk

08.02.2011 20:03, Julian Elischer wrote:

08.02.2011 19:08, rozhuk...@gmail.com wrote:

Did you try ng_ether + ng_ksocket?
It can translate Ethernet frames incapsulated to udp to user space
receiver.


The idea is catch packets from firewall (ng_ipfw, ng_nat was mentioned
by mistake) and pass them to user space module that do some processing
and puts back the packets into firewall (for rules with `diverted'
keyword).


yes, however did you try the ipfw netgraph keyword and the ng_ipfw node?
I have also been wondering it it might not make sense to simpply
replavce the diver code with
a netgraph equivalent.. Using the ng_ipfw node one can almost do it with
no changes as it is.


I've tried ng_socket+ng_ipfw. It gets incoming packets, but outgoing 
packets drops because of a tag having lost after leaving kernel space.
It looks like a magic can be done with ng_tag node, but really I could 
not tame it.






It works now for IPv4 with `divert' and doesn't with IPv6.


yes, I'm pondering the right fix for that..


I'm first to test it please :)
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Proposed patch for Port Randomization modifications according to RFC6056

2011-02-08 Thread Doug Barton
I've been up and running on this patch vs. r218391 for over 24 hours 
now, using algorithm 4 (as someone said is now the default in Linux) 
without any problems.


I think Bjoern is better qualified than I to comment on the style of the 
patch, but it applies cleanly, and seems to run fine on both v4 and v6.



hth,

Doug


On 01/31/2011 04:52, Ivo Vachkov wrote:

Hello,

I attach the latest version of the port randomization code as a patch
against RELENG_8.

Changelog:
1) sysctl variable names are changed to:
- 'net.inet.ip.portrange.randomalg.version' - representing the
algorithm of choice.
- 'net.inet.ip.portrange.randomalg.alg5_tradeoff' - representing the
Algorithm 5 computational tradeoff value (the 'N' value in the
Algorithm 5 description in the RFC 6056).
2) Code comments are synchronized with the current variable names.

Ivo Vachkov

On Sat, Jan 29, 2011 at 4:27 AM, Doug Barton  wrote:

On 01/28/2011 11:57, Ivo Vachkov wrote:


On Fri, Jan 28, 2011 at 9:00 PM, Doug Bartonwrote:



How does net.inet.ip.portrange.randomalg sound? I would also suggest that
the second sysctl be named net.inet.ip.portrange.randomalg.alg5_tradeoff
so
that one could do 'sysctl net.inet.ip.portrange.randomalg' and see both
values. But I won't quibble on that. :)



I have no objections with this. Since this is my first attempt to
contribute something back to the community I decided to see how it's
done before. So I found:
net.inet.tcp.rfc1323
net.inet.tcp.rfc3465
net.inet.tcp.rfc3390
net.inet.tcp.rfc3042
which probably led me in a wrong direction :)


Yeah, I had actually intended to say something to the effect of "there are
plenty of unfortunate examples in the tree already so your doing it that way
is totally understandable" but I trimmed it.


I understand your point and agree with it. However, my somewhat
limited understanding of the sysctl internal organization is telling
me that tree node does not support values. Am I wrong?


You are likely correct. :)  It's an inconvenient fact that often forget
because that's not the sandbox that I usually play in.


If my reasoning
is correct, maybe I can create the sysctl variables with the following
names:
- net.inet.ip.portrange.randomalg (Tree Node)
- net.inet.ip.portrange.randomalg.alg[orithm] (Leaf Node, to store the
selected algorithm)


I would go with "version" to increase the visual distinctiveness. I searched
the current tree and there doesn't seem to be a clear winner for how to
portray "this is the current N/M that is in use" but "version" seems to have
the most representatives.


- net.inet.ip.portrange.randomalg.alg5_tradeoff (Leaf Node, to store
the Algorithm 5 trade-off value)


I'm assuming this is the "N" value mentioned in the RFC. If so, I commend
you on your choice of "tradeoff" to represent it. :)

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: divert rewrite

2011-02-08 Thread rozhuk . im
> -Original Message-
> From: Sergey Matveychuk [mailto:s...@freebsd.org]
> Sent: Wednesday, February 09, 2011 12:53 AM
> To: rozhuk...@gmail.com
> Cc: freebsd-net@freebsd.org
> Subject: Re: divert rewrite
> 
> 08.02.2011 19:08, rozhuk...@gmail.com wrote:
> > Did you try ng_ether + ng_ksocket?
> > It can translate Ethernet frames incapsulated to udp to user space
> receiver.
> 
> The idea is catch packets from firewall (ng_ipfw, ng_nat was mentioned
> by mistake) and pass them to user space module that do some processing
> and puts back the packets into firewall (for rules with `diverted'
> keyword).
> 
> It works now for IPv4 with `divert' and doesn't with IPv6.

I know how divert works, google: uTPControl ;)
Its simple for developmet, stable, but uses many CPU.

With ng_ether + ng_ksocket you can send custom Ethernet frames.
There is some node that can filter traffic, for IPv6 you need allow 1 or 2 
ethernet types to pass.




___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: divert rewrite

2011-02-08 Thread Sergey Matveychuk

08.02.2011 21:47, rozhuk...@gmail.com пишет:

-Original Message-
From: Sergey Matveychuk [mailto:s...@freebsd.org]
Sent: Wednesday, February 09, 2011 12:53 AM
To: rozhuk...@gmail.com
Cc: freebsd-net@freebsd.org
Subject: Re: divert rewrite

08.02.2011 19:08, rozhuk...@gmail.com wrote:

Did you try ng_ether + ng_ksocket?
It can translate Ethernet frames incapsulated to udp to user space

receiver.

The idea is catch packets from firewall (ng_ipfw, ng_nat was mentioned
by mistake) and pass them to user space module that do some processing
and puts back the packets into firewall (for rules with `diverted'
keyword).

It works now for IPv4 with `divert' and doesn't with IPv6.


I know how divert works, google: uTPControl ;)
Its simple for developmet, stable, but uses many CPU.

With ng_ether + ng_ksocket you can send custom Ethernet frames.
There is some node that can filter traffic, for IPv6 you need allow 1 or 2 
ethernet types to pass.


I know. But I've written a module for conjunction with ipfw. It makes a 
decision by some criteria to pass a traffic or to block it. 
Administrators in our nets decide what kind traffic to pass to my module 
(mostly TCP SYN and few UDP) in their firewalls.

So a conjection with ipfw is the goal.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bogus 0 len IP packet, was: Hang in VOP_LOCK1_APV on 8-STABLE with NFS.

2011-02-08 Thread Ronald Klop
On Mon, 07 Feb 2011 01:22:36 +0100, Pyun YongHyeon   
wrote:



On Sun, Feb 06, 2011 at 11:54:49PM +0100, Ronald Klop wrote:

On Sat, 22 Jan 2011 00:01:47 +0100, Ronald Klop
 wrote:

>On Tue, 18 Jan 2011 09:38:04 +0100,  wrote:
>
 So, does anyone have an idea why the IP length field would be set  
to

>>>0
 for these TCP/IP packets?

 Here's some info from Ronald w.r.t. his hardware. (All I can think
>>>of is
 that he could try disabling TSO, etc?)

 Thanks in advance for any help with this, rick

>>>
>>>It seems that issue came from TSO. Driver will set ip_len and
>>>ip_sum field to 0 before passing the TCP segment to controller.
>>>The failed length were 4446, 5858, 3034 and 4310 and the total
>>>number of such frames are more than 35k within 90 seconds. Since
>>>failed length 4310 is continuously repeated I guess there is edge
>>>case where em(4) didn't free failed TCP segment for TSO.
>>>I remember there was commit to HEAD(r217295) which could be related
>>>with this issue.
>>
>>I'm seeing the same problem with Broadcom NetXtreme (bce) cards:
>>
>>bce0@pci0:3:0:0:class=0x02 card=0x03421014 chip=0x164c14e4
>>rev=0x12 hdr=0x00
>>vendor = 'Broadcom Corporation'
>>device = 'Broadcom NetXtreme II Gigabit Ethernet Adapter
>>(BCM5708)'
>>class  = network
>>subclass   = ethernet
>>
>>This is with 8.2-PRERELEASE. Turning off TSO (ifconfig bce0 -tso)
>>removes the problem.
>>
>>Steinar Haug, Nethelp consulting, sth...@nethelp.no
>>___
>>freebsd-net@freebsd.org mailing list
>>http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
>I tried -tso and -txcsum in various combinations, but it didn't solve
>the problem. I wil look for another brand of network card to try. But
>this has to wait till monday when I'm at the office again.

I also used another network card (rl0) and it has the same problem with
NFS. I'm going to change some network cables to see if that helps. I  
have

some hints that there might be something wrong with that.



Hmm, given that rl(4) also shows the issue it seems the issue could
be in TCP/IP stack, not in driver side. rl(4) is dumb device so
network stack should do segmentation and checksum computation.
I highly doubt the issue came from faulty cable since other users
also reported the same issue.
Unfortunately I have no clue yet and I was not able to reproduce it
on my box. I vaguely guess some code in kernel changed the ip_len
to 0 in the middle of transmission. Rick's captured traffic looks
normal except 0 ip_len given that controller is computing checksum
on the fly. If mbuf chain was corrupted(e.g. m_len == 0) driver
would have failed to send those frames.


Changing the cable didn't help indeed. I'm glad the issue is seen by  
others too. I will try to downgrade to an older version of FreeBSD to try  
to find the commit which broke it. But that can take a while, because it  
is time consuming and I have to do some real work also at work. :-)


Thanks for taking the time for it and I hope we will find the cause  
someday,


Ronald.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


jumbo frames + geom_mirror = no net

2011-02-08 Thread rozhuk . im
Hi!


I have 8.2 + latest updates, em + gigabit net, few HDDs in mirror.
Samba for share HDDs to win hosts.
(E5300, G33 + ICH9R, 2GB, PCI-E intel desktop GB adapter) 

ifconfig_em0="inet 172.16.0.254 netmask 255.255.255.0 mtu 9000"

Then I start copy files to mirror (trough net or using cp from others HDD on
host)
after some time host stop respond to net, no errors messages in logs.
(top show free mem < 10mb just before it happen)
(It started after first mirror was created)

And don’t respond until reboot or: ifconfig em0 mtu 1500 (from console)


Vmstat show my failures for mbuf_jumbo_9k:
mbuf_jumbo_9k:   9216, 6400,0,0, 36202768,
8472897
 
but no failures for simple 1,5k mbuf's :/




vmstat -z
ITEM SIZE LIMIT  USED  FREE  REQUESTS
FAILURES

UMA Kegs: 128,0,   97,   23,   97,
0
UMA Zones:888,0,   97,3,   97,
0
UMA Slabs:284,0, 1068,  514,90680,
0
UMA RCntSlabs:544,0, 2649,  529, 13469257,
0
UMA Hash: 128,0,1,   29,4,
0
16 Bucket: 76,0,   12,   88,  122,
0
32 Bucket:140,0,   16,   96,  308,
0
64 Bucket:268,0,   43,   69,  561,
13
128 Bucket:   524,0,  323,6,   599692,
5140
VM OBJECT:136,0,19972,36317,  1260502,
0
MAP:  136,0,7,   22,7,
0
KMAP ENTRY:72,57505, 2188, 2423,  4613615,
0
MAP ENTRY: 72,0, 2418,  762,  4043456,
0
DP fakepg: 72,0,0,0,0,
0
SG fakepg: 72,0,0,0,0,
0
mt_zone: 2056,0,  175,  237,  175,
0
16:16,0, 4239,  836, 16310509,
0
32:32,0, 2514, 1780, 124403442,
0
64:64,0, 5044, 3334, 402804800,
0
128:  128,0,  792, 1428,  7002330,
0
256:  256,0,  592,  593,  1093975,
0
512:  512,0,  165, 1035,   671161,
0
1024:1024,0,   55,  181,   882578,
0
2048:2048,0,  150,  250,73957,
0
4096:4096,0,  132,  243,   734964,
0
Files: 56,0, 5737,  628,  8850040,
0
TURNSTILE: 72,0,  416,   64,  451,
0
umtx pi:   52,0,0,0,0,
0
PROC: 680,0,   81,  171,38814,
0
THREAD:   720,0,  277,  138,11326,
0
SLEEPQUEUE:44,0,  416,  115,  451,
0
VMSPACE:  228,0,   57,  164,31952,
0
cpuset:40,0,2,  182,2,
0
mbuf_packet:  256,0, 4100,  609, 76165127,
0
mbuf: 256,0,  175, 1206, 398211372,
0
mbuf_cluster:2048,65536, 4845,  453, 30848491,
0
mbuf_jumbo_page: 4096,12800,0,0,0,
0
mbuf_jumbo_9k:   9216, 6400,0,0, 36202768,
8472897
mbuf_jumbo_16k: 16384, 3200,0,0,0,
0
mbuf_ext_refcnt:4,0,4,  402, 25681942,
0
ttyoutq:  256,0,   64,   41,  136,
0
g_bio:140,0,0, 1568, 205564803,
0
ttyinq:   152,0,  120,   62,  255,
0
ata_request:  208,0,0,  304, 67886293,
0
ata_composite:180,0,0,0,0,
0
VNODE:268,0,19632,27674,  5052284,
0
VNODEPOLL: 60,0,   16,  173,   21,
0
S VFS Cache:   72,0,12259,51977,  1280216,
0
L VFS Cache:  292,0, 2976,  729,70042,
0
NAMEI:   1024,0,0,  260, 153323563,
0
DIRHASH: 1024,0,   57,  343,57340,
0
AIO:  120,0,3,   93,   19,
0
AIOP:  16,0,4,  402,   44,
0
AIOCB:292,0,0,  390,  6897050,
0
AIOL:  64,0,0,0,0,
0
AIOLIO:   168,0,0,0,0,
0
pipe: 392,0,   50,  170,25245,
0
ksiginfo:

Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Jack Vogel
I have been following this, and thinking about it. I still am working from a
theoretical
standpoint, but based on a patch I got quite a long time back and never
quite groked,
I believe now that I might have a solution.

The original PR and patch was kern/150516 from Beezar Liu,  I was never
quite comfortable
with the code changes, nor convinced that it was a real issue and not a
misunderstanding.
However I think now that this very report might be behind what we are seeing
today. I have
a slightly different approach to solving it, of course it remains to be seen
if it handles it
properly.

Please try the patch I've attached, I'm open to further correction or
polishing of the
changes. And thanks to Beezar for his original report and changes, this is
not for em,
but if this eliminates the problem its clearly needed in all drivers.

Jack
ProxyChains-3.1 (http://proxychains.sf.net)
Index: if_igb.c
===
--- if_igb.c	(revision 218463)
+++ if_igb.c	(working copy)
@@ -4312,6 +4312,7 @@
 		struct mbuf		*sendmp, *mh, *mp;
 		struct igb_rx_buf	*rxbuf;
 		u16			hlen, plen, hdr, vtag;
+		int			commit;
 		bool			eop = FALSE;
  
 		cur = &rxr->rx_base[i];
@@ -4440,10 +4441,22 @@
 		bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map,
 		BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 
+		commit = i;	/* capture the old index */
+
 		/* Advance our pointers to the next descriptor. */
 		if (++i == adapter->num_rx_desc)
 			i = 0;
 		/*
+		** Sanity test for ring full, if this
+		** happens we need to refresh immediately
+		** or refresh may deadlock.
+		*/
+		if (i == rxr->next_to_refresh) {
+			igb_refresh_mbufs(rxr, commit);
+			processed = 0;
+		}
+
+		/*
 		** Send to the stack or LRO
 		*/
 		if (sendmp != NULL) {
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Karim Fodil-Lemelin
2011/2/8 Jack Vogel 

>
> I have been following this, and thinking about it. I still am working from
> a theoretical
> standpoint, but based on a patch I got quite a long time back and never
> quite groked,
> I believe now that I might have a solution.
>
> The original PR and patch was kern/150516 from Beezar Liu,  I was never
> quite comfortable
> with the code changes, nor convinced that it was a real issue and not a
> misunderstanding.
> However I think now that this very report might be behind what we are
> seeing today. I have
> a slightly different approach to solving it, of course it remains to be
> seen if it handles it
> properly.
>
> Please try the patch I've attached, I'm open to further correction or
> polishing of the
> changes. And thanks to Beezar for his original report and changes, this is
> not for em,
> but if this eliminates the problem its clearly needed in all drivers.
>
> Jack
>
>
> Hi Jack,

Thanks for your help. I tried your patch and it didn't work so I added a
couple of printf to see if the added code was getting hit:

--- a/freebsd/sys/dev/e1000/if_igb.c
--More--(byte 1253)+++ b/freebsd/sys/dev/e1000/if_igb.c
@@ -612,7 +612,7 @@ igb_attach(device_t dev)
device_get_nameunit(dev));

INIT_DEBUGOUT("igb_attach: end");
-
+   printf("this driver has a patch from Jack Vogel\n");
return (0);

 err_late:
@@ -4131,6 +4131,7 @@ igb_rxeof(struct igb_queue *que, int count, int *done)
struct mbuf *sendmp, *mh, *mp;
struct igb_rx_buf   *rxbuf;
u16 hlen, plen, hdr, vtag;
+   int commit;
booleop = FALSE;

cur = &rxr->rx_base[i];
@@ -4255,10 +4256,23 @@ next_desc:
bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map,
BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);

+   commit = i; /* capture the old index */
+
/* Advance our pointers to the next descriptor. */
if (++i == adapter->num_rx_desc)
i = 0;
/*
+   ** Sanity test for ring full, if this
+   ** happens we need to refresh immediately
+   ** or refresh may deadlock.
+   */
+   if (i == rxr->next_to_refresh) {
+   igb_refresh_mbufs(rxr, commit);
+   printf("igb_refresh_mbufs called with commit %d\n",
commit);
+   processed = 0;
+   }
+
+   /*
** Send to the stack or LRO
*/
if (sendmp != NULL) {

Here is the results:

# dmesg | grep Vogel
this driver has a patch from Jack Vogel
this driver has a patch from Jack Vogel

# netstat -m
60453/52707/113160 mbufs in use (current/cache/total)
48416/51584/10/10 mbuf clusters in use (current/cache/total/max)
2894/690 mbuf+clusters out of packet secondary zone in use (current/cache)
11946/854/12800/12800 4k (page size) jumbo clusters in use
(current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
164834K/119760K/284595K bytes allocated to network (current/cache/total)
0/339/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/4/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines
# dmesg | grep commit

At this point RX has hung.

Somehow the check (i == rxr->next_to_refresh) is never true in this case.
Also, I did read kern/150516 and couldn't wrap my head around the patch for
the em driver that Beezar Liu suggested.

Regards,

Karim.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: if_run in hostap mode: issue with stations in the power save mode

2011-02-08 Thread Alexander Zagrebin
Hi!

On 08.02.2011 10:52:53 +0100, Bernhard Schmidt wrote:

> I've combined both patches (see attachment), if I get an ACK from both 
> of you I'll try get this into the tree ASAP.

The resulted patch works fine for me.
Big thanks for your help!

Waiting for the 802.11n support... :)

-- 
Alexander Zagrebin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: if_run in hostap mode: issue with stations in the power save mode

2011-02-08 Thread PseudoCylon




- Original Message 
> From: Bernhard Schmidt 
> To: PseudoCylon ; Alexander Zagrebin 
>
> Cc: freebsd-net@freebsd.org
> Sent: Tue, February 8, 2011 2:52:53 AM
> Subject: Re: if_run in hostap mode: issue with stations in the power save mode
> 
> > 
> > The patch is attached. (diff to HEAD) Bit long, just because there  is
> > a couple of  new call back functions to avoid LOR.
> 
> Thank  you!
> 
> I've combined both patches (see attachment), if I get an ACK from  both 
> of you I'll try get this into the tree ASAP.
> 
> -- 
> Bernhard
> 

No objection from me.

Thanks
AK


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Slow Intel 10GbE CX4 adapter behaviour

2011-02-08 Thread rihad
Hi, we're a medium sized ISP that need to pass all incoming user traffic 
through a Intel Server Systems FreeBSD PC and its dummynet pipes. Up 
until yesterday it had two 1 gb em cards, one for input, one for output. 
As we were approaching the bandwidth limitation we switched the cards 
for a two-port Intel 10GbE CX4 PCI-E adapter. With the then used FreeBSD 
7.2 and the built-in FreeBSD ixgbe driver 1.7.3 (IIRC) it was very slow, 
and at only about 300-400 mbps load (~30-50 IP kpps) the internet access 
was very slow. Also, there were many "IP fragmentation failed" errors 
(1-30 kpps in "systat -ip"). So I decided to source-upgrade the world to 
8.3-RC3 (ixgbe 2.3.8). Late in the night yesterday I didn't have enough 
opportunity to test the newer FreeBSD under load, but from the time we 
did and I know, the same slowness started happening at about 300-400 
mbps load. There are no more fragmentation failed errors. No evident 
drops as per "netstat -s | fgrep drop". Only the speed is slooow. Even 
the ssh console lags a bit. Both ix0 and ix1 are configured at their 
default settings.


Then I read something about the number of ixgbe device descriptors 
(hw.ixgbe.txd & hw.ixgbe.rxd) being set low at 256 by default, with up 
to 4096 permittable. But after some grepping on the source tree I saw 
that contrary to what the old docs say they are both set to an optimal 
value:


/sys/dev/ixgbe/ixgbe.c:
/*
** Number of TX descriptors per ring,
** setting higher than RX as this seems
** the better performing choice.
*/
static int ixgbe_txd = PERFORM_TXD;
TUNABLE_INT("hw.ixgbe.txd", &ixgbe_txd);

/* Number of RX descriptors per ring */
static int ixgbe_rxd = PERFORM_RXD;
TUNABLE_INT("hw.ixgbe.rxd", &ixgbe_rxd)


/sys/dev/ixgbe/ixgbe.h:
/*
 * TxDescriptors Valid Range: 64-4096 Default Value: 256 This value is the
 * number of transmit descriptors allocated by the driver. Increasing this
 * value allows the driver to queue more transmits. Each descriptor is 16
 * bytes. Performance tests have show the 2K value to be optimal for top
 * performance.
 */
#define DEFAULT_TXD 1024
#define PERFORM_TXD 2048
#define MAX_TXD 4096
#define MIN_TXD 64



So, here's my kernel config for your viewing pleasure:
include GENERIC

ident   SHAPER

nomakeoptions   DEBUG

nooptions   COMPAT_FREEBSD4 # Compatible with FreeBSD4
nooptions   COMPAT_FREEBSD5 # Compatible with FreeBSD5
nooptions   COMPAT_FREEBSD6 # Compatible with FreeBSD6
options COMPAT_FREEBSD7 # Compatible with FreeBSD7
nooptions   COMPAT_FREEBSD32# Compatible with i386 binaries

nooptions   INET6   # IPv6 communications protocols
options ZERO_COPY_SOCKETS
# XXX 20091227: em(4) wants DEVICE_POLLING off for its fast-interrupts 
to work

#optionsDEVICE_POLLING
options IPFIREWALL  #firewall
options IPFIREWALL_DEFAULT_TO_ACCEPT#allow everything by default


Here's /etc/sysctl.conf:

net.inet.ip.fw.verbose=0

kern.ipc.shmall=65536
kern.ipc.shmmax=268435456
kern.ipc.semmap=1024
kern.ipc.nmbclusters=11

net.inet.ip.fastforwarding=1
net.inet.ip.dummynet.io_fast=1 #XXX no longer used in 8.3??
net.isr.direct=0
net.inet.ip.intr_queue_maxlen=5000

hw.intr_storm_threshold=9000
#dev.em.0.rx_processing_limit=-1 # device not used any more




Any tips? I'll be happy to try and add some more info upon request.


Thanks.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Slow Intel 10GbE CX4 adapter behaviour

2011-02-08 Thread Nikolay Denev
On 9 Feb, 2011, at 07:29 , rihad wrote:

> Hi, we're a medium sized ISP that need to pass all incoming user traffic 
> through a Intel Server Systems FreeBSD PC and its dummynet pipes. Up until 
> yesterday it had two 1 gb em cards, one for input, one for output. As we were 
> approaching the bandwidth limitation we switched the cards for a two-port 
> Intel 10GbE CX4 PCI-E adapter. With the then used FreeBSD 7.2 and the 
> built-in FreeBSD ixgbe driver 1.7.3 (IIRC) it was very slow, and at only 
> about 300-400 mbps load (~30-50 IP kpps) the internet access was very slow. 
> Also, there were many "IP fragmentation failed" errors (1-30 kpps in "systat 
> -ip"). So I decided to source-upgrade the world to 8.3-RC3 (ixgbe 2.3.8). 
> Late in the night yesterday I didn't have enough opportunity to test the 
> newer FreeBSD under load, but from the time we did and I know, the same 
> slowness started happening at about 300-400 mbps load. There are no more 
> fragmentation failed errors. No evident drops as per "netstat -s | fgrep 
> drop". Only the speed is slooow. Even the ssh console lags a bit. Both ix0 
> and ix1 are configured at their default settings.
> 
> Then I read something about the number of ixgbe device descriptors 
> (hw.ixgbe.txd & hw.ixgbe.rxd) being set low at 256 by default, with up to 
> 4096 permittable. But after some grepping on the source tree I saw that 
> contrary to what the old docs say they are both set to an optimal value:
> 
> /sys/dev/ixgbe/ixgbe.c:
> /*
> ** Number of TX descriptors per ring,
> ** setting higher than RX as this seems
> ** the better performing choice.
> */
> static int ixgbe_txd = PERFORM_TXD;
> TUNABLE_INT("hw.ixgbe.txd", &ixgbe_txd);
> 
> /* Number of RX descriptors per ring */
> static int ixgbe_rxd = PERFORM_RXD;
> TUNABLE_INT("hw.ixgbe.rxd", &ixgbe_rxd)
> 
> 
> /sys/dev/ixgbe/ixgbe.h:
> /*
> * TxDescriptors Valid Range: 64-4096 Default Value: 256 This value is the
> * number of transmit descriptors allocated by the driver. Increasing this
> * value allows the driver to queue more transmits. Each descriptor is 16
> * bytes. Performance tests have show the 2K value to be optimal for top
> * performance.
> */
> #define DEFAULT_TXD 1024
> #define PERFORM_TXD 2048
> #define MAX_TXD 4096
> #define MIN_TXD 64
> 
> 
> 
> So, here's my kernel config for your viewing pleasure:
> include GENERIC
> 
> ident   SHAPER
> 
> nomakeoptions   DEBUG
> 
> nooptions   COMPAT_FREEBSD4 # Compatible with FreeBSD4
> nooptions   COMPAT_FREEBSD5 # Compatible with FreeBSD5
> nooptions   COMPAT_FREEBSD6 # Compatible with FreeBSD6
> options COMPAT_FREEBSD7 # Compatible with FreeBSD7
> nooptions   COMPAT_FREEBSD32# Compatible with i386 binaries
> 
> nooptions   INET6   # IPv6 communications protocols
> options ZERO_COPY_SOCKETS
> # XXX 20091227: em(4) wants DEVICE_POLLING off for its fast-interrupts to work
> #optionsDEVICE_POLLING
> options IPFIREWALL  #firewall
> options IPFIREWALL_DEFAULT_TO_ACCEPT#allow everything by default
> 
> 
> Here's /etc/sysctl.conf:
> 
> net.inet.ip.fw.verbose=0
> 
> kern.ipc.shmall=65536
> kern.ipc.shmmax=268435456
> kern.ipc.semmap=1024
> kern.ipc.nmbclusters=11
> 
> net.inet.ip.fastforwarding=1
> net.inet.ip.dummynet.io_fast=1 #XXX no longer used in 8.3??
> net.isr.direct=0
> net.inet.ip.intr_queue_maxlen=5000
> 
> hw.intr_storm_threshold=9000
> #dev.em.0.rx_processing_limit=-1 # device not used any more
> 
> 
> 
> 
> Any tips? I'll be happy to try and add some more info upon request.
> 
> 
> Thanks.
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

I don't know if it's the same issue, but I had severe performance issues 
with ixgbe cards until I disable LRO (ifconfig ix0 -lro). That was on 7.2 too.

Regards,
Nikolay___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: igb driver RX (was TX) hangs when out of mbuf clusters

2011-02-08 Thread Jack Vogel
Hmmm, well so much for that theory :)

Jack


On Tue, Feb 8, 2011 at 4:06 PM, Karim Fodil-Lemelin <
fodillemlinka...@gmail.com> wrote:

>
>
> 2011/2/8 Jack Vogel 
>
>
>> I have been following this, and thinking about it. I still am working from
>> a theoretical
>> standpoint, but based on a patch I got quite a long time back and never
>> quite groked,
>> I believe now that I might have a solution.
>>
>> The original PR and patch was kern/150516 from Beezar Liu,  I was never
>> quite comfortable
>> with the code changes, nor convinced that it was a real issue and not a
>> misunderstanding.
>> However I think now that this very report might be behind what we are
>> seeing today. I have
>> a slightly different approach to solving it, of course it remains to be
>> seen if it handles it
>> properly.
>>
>> Please try the patch I've attached, I'm open to further correction or
>> polishing of the
>> changes. And thanks to Beezar for his original report and changes, this is
>> not for em,
>> but if this eliminates the problem its clearly needed in all drivers.
>>
>> Jack
>>
>>
>> Hi Jack,
>
> Thanks for your help. I tried your patch and it didn't work so I added a
> couple of printf to see if the added code was getting hit:
>
> --- a/freebsd/sys/dev/e1000/if_igb.c
> --More--(byte 1253)+++ b/freebsd/sys/dev/e1000/if_igb.c
> @@ -612,7 +612,7 @@ igb_attach(device_t dev)
> device_get_nameunit(dev));
>
> INIT_DEBUGOUT("igb_attach: end");
> -
> +   printf("this driver has a patch from Jack Vogel\n");
> return (0);
>
>  err_late:
> @@ -4131,6 +4131,7 @@ igb_rxeof(struct igb_queue *que, int count, int
> *done)
> struct mbuf *sendmp, *mh, *mp;
> struct igb_rx_buf   *rxbuf;
> u16 hlen, plen, hdr, vtag;
> +   int commit;
> booleop = FALSE;
>
> cur = &rxr->rx_base[i];
> @@ -4255,10 +4256,23 @@ next_desc:
> bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map,
> BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
>
> +   commit = i; /* capture the old index */
> +
> /* Advance our pointers to the next descriptor. */
> if (++i == adapter->num_rx_desc)
> i = 0;
> /*
> +   ** Sanity test for ring full, if this
> +   ** happens we need to refresh immediately
> +   ** or refresh may deadlock.
> +   */
> +   if (i == rxr->next_to_refresh) {
> +   igb_refresh_mbufs(rxr, commit);
> +   printf("igb_refresh_mbufs called with commit %d\n",
> commit);
> +   processed = 0;
> +   }
> +
> +   /*
> ** Send to the stack or LRO
> */
> if (sendmp != NULL) {
>
> Here is the results:
>
> # dmesg | grep Vogel
> this driver has a patch from Jack Vogel
> this driver has a patch from Jack Vogel
>
> # netstat -m
> 60453/52707/113160 mbufs in use (current/cache/total)
> 48416/51584/10/10 mbuf clusters in use (current/cache/total/max)
> 2894/690 mbuf+clusters out of packet secondary zone in use (current/cache)
> 11946/854/12800/12800 4k (page size) jumbo clusters in use
> (current/cache/total/max)
> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> 164834K/119760K/284595K bytes allocated to network (current/cache/total)
> 0/339/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/4/6656 sfbufs in use (current/peak/max)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
> # dmesg | grep commit
>
> At this point RX has hung.
>
> Somehow the check (i == rxr->next_to_refresh) is never true in this case.
> Also, I did read kern/150516 and couldn't wrap my head around the patch for
> the em driver that Beezar Liu suggested.
>
> Regards,
>
> Karim.
>
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"