[PATCH] BPF locking redesign

2012-03-26 Thread Alexander V. Chernikov

Hello list!

There are some patches that can significantly improve forwarding 
performance if BPF consumers are present. However, some changes are a 
bit hackish and ABI change is required. Those are split into separate 
patches I want to discuss.


You probably need to merge r233505 for patches to work.

Description: bpf_rwlock

This is simple and straight-forwarded, we convert interface and 
descriptor locks to rwlock(9).


Additionally, filter(descriptor) (reader) lock in bpf_mtap[2] is 
removed. This was suggested by glebius@. We protect filter by requesting 
interface writer lock on filter change.


This greately improves performance: in most common case we need to 
acquire 1 reader lock instead of 2 mutexes.


bpf_if structure is now covered by  BPF_INTERNAL define. This permits 
including bpf.h without including rwlock stuff. However, this is is 
temporary solution, struct bpf_if should be made opaque for any external 
caller.



Description: bpf_writers

Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families 
to send
raw ethernet frames. The only FreeBSD interface that can be used to send 
raw frames

is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff uses
BPF only to send data. This leads us to the situation when software like 
cdpd,
being run on high-traffic-volume interface significantly reduces overall 
performance

since we have to acquire additional locks for every packet.

Here we add sysctl that changes BPF behavior in the following way:
If program came and opens BPF socket without explicitly specifyin read 
filter we
assume it to be write-only and add it to special writer-only 
per-interface list.
This makes bpf_peers_present() return 0, so no additional overhead is 
introduced.
After filter is supplied, descriptor is added to original per-interface 
list permitting

packets to be captured.

Unfortunately, pcap_open_live() sets catch-all filter itself for the 
purpose of

setting snap length.

Fortunately, most programs explicitly sets (event catch-all) filter 
after that.

tcpdump(1) is a good example.

So a bit hackis approach is taken: we upgrade description only after second
BIOCSETF is received.

Sysctl is named net.bpf.optimize_writers and is turned off by default.


Description: bpf_if_opaque.

No patch at the moment. We can probably do the following:

Create bpf_if structure on bpfattach2, but do not assign it to supplied 
pointer. Instead we put new structure, pointer, interface pointer to 
some kind of hash (with interface name as key). When _reader_ comes AND 
sets valid filter, we checks if we need to attach bpfif to interface.

When reader disconnects, we set bpfif pointer back to zero.
No additional locking is required here: same struct bpfif lives as long 
as interface exists, so pointer will be either NULL or pointer to 
structure in and period of time. Even if some thread on other CPU sees 
non-coherent value - no problem. NULL means no filter is set so we skips 
BPF, non-null starts BPF_MTAP which acquires rlock and determines that 
no peers are present.


As a result, bpf_peers_present(x) simply returns (x).
There is no need to expose struct bpfif to external users.
Additionally, we do not request interface write lock to be acquired on 
every interface attach/departure which can potentially lead to small 
number of packets being dropped on mpd servers.


Btw, we can consider changing rlock / wlock to _try_ one to avoid this 
behavior.


Major drawback of this approach is totally broken ABI.
Any pre-compiled network driver carries inlined bpf_peers_present() 
which assumes ifp->if_bpf to be non-NULL.


However, we can introduce bpfattach3() or some special interface flag 
(set by if_alloc(), for example),  indicating that driver uses new api, 
and retain original behavior for old drivers.




--
WBR, Alexander
>From 2ca09cf74ef63fbde0ccede4a7e883c6f0add51d Mon Sep 17 00:00:00 2001
From: "Alexander V. Chernikov" 
Date: Mon, 26 Mar 2012 14:58:13 +0400
Subject: [PATCH 2/2] Optimize BPF writers

---
 share/man/man4/bpf.4 |   31 +---
 sys/net/bpf.c|   97 ++
 sys/net/bpf.h|1 +
 sys/net/bpfdesc.h|1 +
 4 files changed, 120 insertions(+), 10 deletions(-)

diff --git a/share/man/man4/bpf.4 b/share/man/man4/bpf.4
index b9d6742..44bed39 100644
--- a/share/man/man4/bpf.4
+++ b/share/man/man4/bpf.4
@@ -952,10 +952,33 @@ array initializers:
 .Fn BPF_STMT opcode operand
 and
 .Fn BPF_JUMP opcode operand true_offset false_offset .
-.Sh FILES
-.Bl -tag -compact -width /dev/bpf
-.It Pa /dev/bpf
-the packet filter device
+.Sh SYSCTL VARIABLES
+A set of
+.Xr sysctl 8
+variables controls the behaviour of the
+.Nm
+subsystem
+.Bl -tag -width indent
+.It Va net.bpf.optimize_writers: No 0
+Various programs use BPF to send (but not receive) raw packets
+(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs).
+They do not need incoming packets to be send 

Current problem reports assigned to freebsd-net@FreeBSD.org

2012-03-26 Thread FreeBSD bugmaster
Note: to view an individual PR, use:
  http://www.freebsd.org/cgi/query-pr.cgi?pr=(number).

The following is a listing of current problems submitted by FreeBSD users.
These represent problem reports covering all versions including
experimental development code and obsolete releases.


S Tracker  Resp.  Description

o kern/166372  net[patch] ipfilter drops UDP packets with zero checksum 
o kern/166285  net[arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres
o kern/166255  net[net] [patch] It should be possible to disable "promis
o kern/165963  net[panic] [ipf] ipfilter/nat NULL pointer deference
o kern/165903  netmbuf leak
o kern/165863  net[panic] [netinet] [patch] in_lltable_prefix_free() rac
o kern/165643  net[net] [patch] Missing vnet restores in net/if_ethersub
o kern/165622  net[ndis][panic][patch] Unregistered use of FPU in kernel
s kern/165562  net[request] add support for Intel i350 in FreeBSD 7.4
o kern/165526  net[bxe] UDP packets checksum calculation whithin if_bxe 
o kern/165488  net[ppp] [panic] Fatal trap 12 jails and ppp , kernel wit
o kern/165305  net[ip6] [request] Feature parity between IP_TOS and IPV6
o kern/165296  net[vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR
o kern/165181  net[igb] igb freezes after about 2 weeks of uptime
o kern/165174  net[patch] [tap] allow tap(4) to keep its address on clos
o kern/165152  net[ip6] Does not work through the issue of ipv6 addresse
o kern/164569  net[msk] [hang] msk network driver cause freeze in FreeBS
o kern/164495  net[igb] connect double head igb to switch cause system t
o kern/164490  net[pfil] Incorrect IP checksum on pfil pass from ip_outp
o kern/164475  net[gre] gre misses RUNNING flag after a reboot
o kern/164400  net[ipsec] immediate crash after the start of ipsec proce
o kern/164265  net[netinet] [patch] tcp_lro_rx computes wrong checksum i
o kern/163903  net[igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL
o kern/163481  netfreebsd do not add itself to ping route packet
o kern/162927  net[tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing
o kern/162926  net[ipfilter] Infinite loop in ipfilter with fragmented I
o kern/162558  net[dummynet] [panic] seldom dummynet panics
o kern/162509  net[re] [panic] Kernel panic may be related to if_re.c (r
o kern/162153  net[em] intel em driver 7.2.4 don't compile
o kern/162110  net[igb] [panic] RELENG_9 panics on boot in IGB driver - 
o kern/162028  net[ixgbe] [patch] misplaced #endif in ixgbe.c
o kern/161381  net[re] RTL8169SC - re0: PHY write failed
o kern/161277  net[em] [patch] BMC cannot receive IPMI traffic after loa
o kern/160873  net[igb] igb(4) from HEAD fails to build on 7-STABLE
o kern/160750  netIntel PRO/1000 connection breaks under load until rebo
o kern/160693  net[gif] [em] Multicast packet are not passed from GIF0 t
o kern/160420  net[msk] phy write timeout on HP 5310m
o kern/160293  net[ieee80211] ppanic] kernel panic during network setup 
o kern/160206  net[gif] gifX stops working after a while (IPv6 tunnel)
o kern/159817  net[udp] write UDPv4: No buffer space available (code=55)
o kern/159629  net[ipsec] [panic] kernel panic with IPsec in transport m
o kern/159621  net[tcp] [panic] panic: soabort: so_count
o kern/159603  net[netinet] [patch] in_ifscrubprefix() - network route c
o kern/159601  net[netinet] [patch] in_scrubprefix() - loopback route re
o kern/159294  net[em] em watchdog timeouts
o kern/159203  net[wpi] Intel 3945ABG Wireless LAN not support IBSS
o kern/158930  net[bpf] BPF element leak in ifp->bpf_if->bif_dlist
o kern/158726  net[ip6] [patch] ICMPv6 Router Announcement flooding limi
o kern/158694  net[ix] [lagg] ix0 is not working within lagg(4)
o kern/158665  net[ip6] [panic] kernel pagefault in in6_setscope()
o kern/158635  net[em] TSO breaks BPF packet captures with em driver
f kern/157802  net[dummynet] [panic] kernel panic in dummynet
o kern/157785  netamd64 + jail + ipfw + natd = very slow outbound traffi
o kern/157429  net[re] Realtek RTL8169 doesn't work with re(4)
o kern/157418  net[em] em driver lockup during boot on Supermicro X9SCM-
o kern/157410  net[ip6] IPv6 Router Advertisements Cause Excessive CPU U
o kern/157287  net[re] [panic] INVARIANTS panic (Memory modified after f
o kern/157209  net[ip6] [patch] locking error in rip6_input() (sys/netin
o kern/157200  net[network.subr] [patch] stf(4) can not communicate betw
o kern/157182  net[lagg] lagg interface not working together

The problem with MTU <1500

2012-03-26 Thread Владислав Продан

Dear all!
I have two different values of uplink MTU - 1440 and 1492.
And in a local network - 1500.
In Cisco routers can be set 'set ip df 0', but I have no such router :(
What options are there for FreeBSD?
Only use the net/tcpmssd?
Will it be possible to process both two channels of 1Gb?


-- 
Vladislav V. Prodan
System & Network Administrator 
http://support.od.ua   
+380 67 4584408, +380 99 4060508
VVP88-RIPE
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: The problem with MTU <1500

2012-03-26 Thread Michael Sierchio
2012/3/26 Владислав Продан 

>
> Dear all!
> I have two different values of uplink MTU - 1440 and 1492.
> And in a local network - 1500.
> In Cisco routers can be set 'set ip df 0', but I have no such router :(
> What options are there for FreeBSD?
>

 man ifconfig - not the 'mtu' option.  You should set the MTU size on the
interface to match the uplink MTU.  And you should make sure you handle
ICMP need-frag error messages, otherwise Path MTU discovery will be broken.

Only use the net/tcpmssd?


That's a workaround for not properly handling ICMP error messages, and just
for TCP ;-)


> Will it be possible to process both two channels of 1Gb?
>
> Never. ;-)  Not in practice.  What is the bandwidth - delay product?
That's a fraction of the 1Gbps you'll never get.

- M
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Intel 82550 Pro/100 Ethernet and Microcode

2012-03-26 Thread Andreas Longwitz
YongHyeon PYUN wrote:
> 
> I've attached a patch which will show both compatibility and EEPROM
> ID word and allowed loading microcode for 82550 family.  Let me
> know whether this patch works for you.

Yes, the patch workes fine and I have for my LOM's (rev 0x0d):

fxp0:  port 
fxp0: Compatibility word : 0X0d13
fxp0: EEPROM ID word : 0X5060

and for two external cards (rev 0x0c)  I see

fxp1:  port ...
fxp1: Compatibility word : 0X040b
fxp1: EEPROM ID word : 0X58a0
fxp2:  port ...
fxp2: Compatibility word : 0X020b
fxp2: EEPROM ID word : 0X50a0

> I vaguely guess there are
> differences between LOM and non-LOM implementation(Mine is
> stand-alone PCI NIC).  I would be able to selectively allow
> microcode loading after getting compatibility/EEPROM ID word.

Thank you for investigating in this problem.

Regards

Andreas Longwitz

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: 9-STABLE + Infiniband - incorrect interface counters

2012-03-26 Thread John Baldwin
On Saturday, March 24, 2012 7:52:08 am Alex Tutubalin wrote:
> Hi,
> 
> I'm playing with two FreeBSD 9-STABLE boxes connected via 10Gbps 
> Infiniband (more details below) in Infiniband connected mode.
> 
> I see incorrect interface statistics (e.g. in netstat output), output 
> counters are 2x more than expected.
> 
> EXAMPLE, ftp transfer of 1 GiB file:
> 
> ftp> put file /dev/null
> local: file remote: /dev/null
> 229 Entering Extended Passive Mode (|||57978|)
> 150 Opening BINARY mode data connection for '/dev/null'.
> 100% |***|   953 MiB  390.43 MiB/s
> 00:00 ETA
> 226 Transfer complete.
> 10 bytes sent in 00:02 (390.13 MiB/s)
> 
> Netstat on receiving side, counters are correct (for input):
> 
> lexa@home-gw:/home/lexa# netstat -I ib1 5
>  input  (ib1)   output
> packets  errs idrops  bytespackets  errs  bytes colls
>   0 0 0  0  0 0  0 0
>   13955 0 0  222688126   9027 01192796 0
>   48921 0 0  780832960  32129 04240596 0
>   0 0 0  0  0 0 80 0
> 
> Sum of bytes (input) is 1003521086, as expected.
> 
> Netstat on sending size, output is 2x more:
> 
> lexa@new-gw:/home/lexa# netstat -I ib0 5
>  input  (ib0)   output
> packets  errs idrops  bytespackets  errs  bytes colls
>   1 0 0100  0 0  0 0
>   41162 0 02305210  62878 0 2008325984 0
>   1 0 0100  0 0  0 0
> 
> It looks like packet count is correct (13955+48921=62876, two packets 
> missed somewhere), while byte count is exact 2x more.

Yes, this is a bug.  if_obytes already gets incremented in IFQ_HANDOFF(), so 
the IB code doesn't need to do it again.  Try the patch at 
www.freebsd.org/~jhb/patches/ipoib_obytes.patch

I can't speak to the MTU issue though.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: The problem with MTU <1500

2012-03-26 Thread Вадим Уразаев
You may use this patch to ipfw
http://www.freebsd.org/cgi/query-pr.cgi?pr=103454
Or you may use ng_patch netgraph node (man ng_patch(4) should give you some
examples)
The simpliest way to do what you want is pf rule
 scrub in on em0 no-df

performance of all those methods you should check by youre self.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: The problem with MTU <1500

2012-03-26 Thread Andrey Zonov

On 26.03.2012 17:43, Владислав Продан wrote:


Dear all!
I have two different values of uplink MTU - 1440 and 1492.
And in a local network - 1500.
In Cisco routers can be set 'set ip df 0', but I have no such router :(
What options are there for FreeBSD?


sysctl net.inet.tcp.path_mtu_discovery=0


Only use the net/tcpmssd?
Will it be possible to process both two channels of 1Gb?




--
Andrey Zonov
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: The problem with MTU <1500

2012-03-26 Thread Andrey Zonov

On 27.03.2012 0:06, Andrey Zonov wrote:

On 26.03.2012 17:43, Владислав Продан wrote:


Dear all!
I have two different values of uplink MTU - 1440 and 1492.
And in a local network - 1500.
In Cisco routers can be set 'set ip df 0', but I have no such router :(
What options are there for FreeBSD?


sysctl net.inet.tcp.path_mtu_discovery=0


Ooops, please ignore this.




Only use the net/tcpmssd?
Will it be possible to process both two channels of 1Gb?






--
Andrey Zonov
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Intel 82550 Pro/100 Ethernet and Microcode

2012-03-26 Thread YongHyeon PYUN
On Mon, Mar 26, 2012 at 04:37:45PM +0200, Andreas Longwitz wrote:
> YongHyeon PYUN wrote:
> > 
> > I've attached a patch which will show both compatibility and EEPROM
> > ID word and allowed loading microcode for 82550 family.  Let me
> > know whether this patch works for you.
> 
> Yes, the patch workes fine and I have for my LOM's (rev 0x0d):
> 
> fxp0:  port 
> fxp0: Compatibility word : 0X0d13
> fxp0: EEPROM ID word : 0X5060
> 
> and for two external cards (rev 0x0c)  I see
> 
> fxp1:  port ...
> fxp1: Compatibility word : 0X040b
> fxp1: EEPROM ID word : 0X58a0
> fxp2:  port ...
> fxp2: Compatibility word : 0X020b
> fxp2: EEPROM ID word : 0X50a0
> 

Thanks a lot! Here is final version. The patch checks whether
the controller is i82550C with server extension.  If driver see 
server NIC like yours it would allow microcode loading.
I guess later 82550C controllers fixed the bug since mine has no
problems to handle fragmented UDP datagrams.

> > I vaguely guess there are
> > differences between LOM and non-LOM implementation(Mine is
> > stand-alone PCI NIC).  I would be able to selectively allow
> > microcode loading after getting compatibility/EEPROM ID word.
> 
> Thank you for investigating in this problem.
> 
> Regards
> 
> Andreas Longwitz
> 
Index: sys/dev/fxp/if_fxpvar.h
===
--- sys/dev/fxp/if_fxpvar.h (revision 233418)
+++ sys/dev/fxp/if_fxpvar.h (working copy)
@@ -236,6 +236,7 @@
 #define FXP_FLAG_WOLCAP0x2000  /* WOL capability */
 #define FXP_FLAG_WOL   0x4000  /* WOL active */
 #define FXP_FLAG_RXBUG 0x8000  /* Rx lock-up bug */
+#define FXP_FLAG_NO_UCODE  0x1 /* ucode is not applicable */
 
 /* Macros to ease CSR access. */
 #defineCSR_READ_1(sc, reg) bus_read_1(sc->fxp_res[0], reg)
Index: sys/dev/fxp/if_fxp.c
===
--- sys/dev/fxp/if_fxp.c(revision 233418)
+++ sys/dev/fxp/if_fxp.c(working copy)
@@ -194,7 +194,7 @@
 { 0x1229,  0x08,   0, "Intel 82559 Pro/100 Ethernet" },
 { 0x1229,  0x09,   0, "Intel 82559ER Pro/100 Ethernet" },
 { 0x1229,  0x0c,   0, "Intel 82550 Pro/100 Ethernet" },
-{ 0x1229,  0x0d,   0, "Intel 82550 Pro/100 Ethernet" },
+{ 0x1229,  0x0d,   0, "Intel 82550C Pro/100 Ethernet" },
 { 0x1229,  0x0e,   0, "Intel 82550 Pro/100 Ethernet" },
 { 0x1229,  0x0f,   0, "Intel 82551 Pro/100 Ethernet" },
 { 0x1229,  0x10,   0, "Intel 82551 Pro/100 Ethernet" },
@@ -525,6 +525,18 @@
sc->flags |= FXP_FLAG_WOLCAP;
}
 
+   if (sc->revision == FXP_REV_82550_C) {
+   /*
+* 82550C with server extension requires microcode to
+* receive fragmented UDP datagrams.  However if the
+* microcode is used for client-only featured 82550C
+* it locks up controller.
+*/
+   fxp_read_eeprom(sc, &data, 3, 1);
+   if ((data & 0x0400) == 0)
+   sc->flags |= FXP_FLAG_NO_UCODE;
+   }
+
/* Receiver lock-up workaround detection. */
if (sc->revision < FXP_REV_82558_A4) {
fxp_read_eeprom(sc, &data, 3, 1);
@@ -3014,10 +3026,8 @@
 static uint32_t fxp_ucode_d101b0[] = D101_B0_RCVBUNDLE_UCODE;
 static uint32_t fxp_ucode_d101ma[] = D101M_B_RCVBUNDLE_UCODE;
 static uint32_t fxp_ucode_d101s[] = D101S_RCVBUNDLE_UCODE;
-#ifdef notyet
 static uint32_t fxp_ucode_d102[] = D102_B_RCVBUNDLE_UCODE;
 static uint32_t fxp_ucode_d102c[] = D102_C_RCVBUNDLE_UCODE;
-#endif
 static uint32_t fxp_ucode_d102e[] = D102_E_RCVBUNDLE_UCODE;
 
 #define UCODE(x)   x, sizeof(x)/sizeof(uint32_t)
@@ -3035,12 +3045,10 @@
D101M_CPUSAVER_DWORD, D101M_CPUSAVER_BUNDLE_MAX_DWORD },
{ FXP_REV_82559S_A, UCODE(fxp_ucode_d101s),
D101S_CPUSAVER_DWORD, D101S_CPUSAVER_BUNDLE_MAX_DWORD },
-#ifdef notyet
{ FXP_REV_82550, UCODE(fxp_ucode_d102),
D102_B_CPUSAVER_DWORD, D102_B_CPUSAVER_BUNDLE_MAX_DWORD },
{ FXP_REV_82550_C, UCODE(fxp_ucode_d102c),
D102_C_CPUSAVER_DWORD, D102_C_CPUSAVER_BUNDLE_MAX_DWORD },
-#endif
{ FXP_REV_82551_F, UCODE(fxp_ucode_d102e),
D102_E_CPUSAVER_DWORD, D102_E_CPUSAVER_BUNDLE_MAX_DWORD },
{ FXP_REV_82551_10, UCODE(fxp_ucode_d102e),
@@ -3055,6 +3063,9 @@
struct fxp_cb_ucode *cbp;
int i;
 
+   if (sc->flags & FXP_FLAG_NO_UCODE)
+   return;
+
for (uc = ucode_table; uc->ucode != NULL; uc++)
if (sc->revision == uc->revision)
break;
@@ -3087,6 +3098,7 @@
sc->tunable_int_delay,
uc->bundle_max_offset == 0 ? 0 : sc->tunable_bundle_max);
sc->flags |= FXP_FLAG_UCODE;
+   bzero(cbp, FXP_TXCB_SZ);
 }
 
 #define FXP_SYSCTL_STAT_ADD(c, h, n, p, d) \
___
f