[PATCH] BPF locking redesign
Hello list! There are some patches that can significantly improve forwarding performance if BPF consumers are present. However, some changes are a bit hackish and ABI change is required. Those are split into separate patches I want to discuss. You probably need to merge r233505 for patches to work. Description: bpf_rwlock This is simple and straight-forwarded, we convert interface and descriptor locks to rwlock(9). Additionally, filter(descriptor) (reader) lock in bpf_mtap[2] is removed. This was suggested by glebius@. We protect filter by requesting interface writer lock on filter change. This greately improves performance: in most common case we need to acquire 1 reader lock instead of 2 mutexes. bpf_if structure is now covered by BPF_INTERNAL define. This permits including bpf.h without including rwlock stuff. However, this is is temporary solution, struct bpf_if should be made opaque for any external caller. Description: bpf_writers Linux and Solaris (at least OpenSolaris) has PF_PACKET socket families to send raw ethernet frames. The only FreeBSD interface that can be used to send raw frames is BPF. As a result, many programs like cdpd, lldpd, various dhcp stuff uses BPF only to send data. This leads us to the situation when software like cdpd, being run on high-traffic-volume interface significantly reduces overall performance since we have to acquire additional locks for every packet. Here we add sysctl that changes BPF behavior in the following way: If program came and opens BPF socket without explicitly specifyin read filter we assume it to be write-only and add it to special writer-only per-interface list. This makes bpf_peers_present() return 0, so no additional overhead is introduced. After filter is supplied, descriptor is added to original per-interface list permitting packets to be captured. Unfortunately, pcap_open_live() sets catch-all filter itself for the purpose of setting snap length. Fortunately, most programs explicitly sets (event catch-all) filter after that. tcpdump(1) is a good example. So a bit hackis approach is taken: we upgrade description only after second BIOCSETF is received. Sysctl is named net.bpf.optimize_writers and is turned off by default. Description: bpf_if_opaque. No patch at the moment. We can probably do the following: Create bpf_if structure on bpfattach2, but do not assign it to supplied pointer. Instead we put new structure, pointer, interface pointer to some kind of hash (with interface name as key). When _reader_ comes AND sets valid filter, we checks if we need to attach bpfif to interface. When reader disconnects, we set bpfif pointer back to zero. No additional locking is required here: same struct bpfif lives as long as interface exists, so pointer will be either NULL or pointer to structure in and period of time. Even if some thread on other CPU sees non-coherent value - no problem. NULL means no filter is set so we skips BPF, non-null starts BPF_MTAP which acquires rlock and determines that no peers are present. As a result, bpf_peers_present(x) simply returns (x). There is no need to expose struct bpfif to external users. Additionally, we do not request interface write lock to be acquired on every interface attach/departure which can potentially lead to small number of packets being dropped on mpd servers. Btw, we can consider changing rlock / wlock to _try_ one to avoid this behavior. Major drawback of this approach is totally broken ABI. Any pre-compiled network driver carries inlined bpf_peers_present() which assumes ifp->if_bpf to be non-NULL. However, we can introduce bpfattach3() or some special interface flag (set by if_alloc(), for example), indicating that driver uses new api, and retain original behavior for old drivers. -- WBR, Alexander >From 2ca09cf74ef63fbde0ccede4a7e883c6f0add51d Mon Sep 17 00:00:00 2001 From: "Alexander V. Chernikov" Date: Mon, 26 Mar 2012 14:58:13 +0400 Subject: [PATCH 2/2] Optimize BPF writers --- share/man/man4/bpf.4 | 31 +--- sys/net/bpf.c| 97 ++ sys/net/bpf.h|1 + sys/net/bpfdesc.h|1 + 4 files changed, 120 insertions(+), 10 deletions(-) diff --git a/share/man/man4/bpf.4 b/share/man/man4/bpf.4 index b9d6742..44bed39 100644 --- a/share/man/man4/bpf.4 +++ b/share/man/man4/bpf.4 @@ -952,10 +952,33 @@ array initializers: .Fn BPF_STMT opcode operand and .Fn BPF_JUMP opcode operand true_offset false_offset . -.Sh FILES -.Bl -tag -compact -width /dev/bpf -.It Pa /dev/bpf -the packet filter device +.Sh SYSCTL VARIABLES +A set of +.Xr sysctl 8 +variables controls the behaviour of the +.Nm +subsystem +.Bl -tag -width indent +.It Va net.bpf.optimize_writers: No 0 +Various programs use BPF to send (but not receive) raw packets +(cdpd, lldpd, dhcpd, dhcp relays, etc. are good examples of such programs). +They do not need incoming packets to be send
Current problem reports assigned to freebsd-net@FreeBSD.org
Note: to view an individual PR, use: http://www.freebsd.org/cgi/query-pr.cgi?pr=(number). The following is a listing of current problems submitted by FreeBSD users. These represent problem reports covering all versions including experimental development code and obsolete releases. S Tracker Resp. Description o kern/166372 net[patch] ipfilter drops UDP packets with zero checksum o kern/166285 net[arp] FreeBSD v8.1 REL p8 arp: unknown hardware addres o kern/166255 net[net] [patch] It should be possible to disable "promis o kern/165963 net[panic] [ipf] ipfilter/nat NULL pointer deference o kern/165903 netmbuf leak o kern/165863 net[panic] [netinet] [patch] in_lltable_prefix_free() rac o kern/165643 net[net] [patch] Missing vnet restores in net/if_ethersub o kern/165622 net[ndis][panic][patch] Unregistered use of FPU in kernel s kern/165562 net[request] add support for Intel i350 in FreeBSD 7.4 o kern/165526 net[bxe] UDP packets checksum calculation whithin if_bxe o kern/165488 net[ppp] [panic] Fatal trap 12 jails and ppp , kernel wit o kern/165305 net[ip6] [request] Feature parity between IP_TOS and IPV6 o kern/165296 net[vlan] [patch] Fix EVL_APPLY_VLID, update EVL_APPLY_PR o kern/165181 net[igb] igb freezes after about 2 weeks of uptime o kern/165174 net[patch] [tap] allow tap(4) to keep its address on clos o kern/165152 net[ip6] Does not work through the issue of ipv6 addresse o kern/164569 net[msk] [hang] msk network driver cause freeze in FreeBS o kern/164495 net[igb] connect double head igb to switch cause system t o kern/164490 net[pfil] Incorrect IP checksum on pfil pass from ip_outp o kern/164475 net[gre] gre misses RUNNING flag after a reboot o kern/164400 net[ipsec] immediate crash after the start of ipsec proce o kern/164265 net[netinet] [patch] tcp_lro_rx computes wrong checksum i o kern/163903 net[igb] "igb0:tx(0)","bpf interface lock" v2.2.5 9-STABL o kern/163481 netfreebsd do not add itself to ping route packet o kern/162927 net[tun] Modem-PPP error ppp[1538]: tun0: Phase: Clearing o kern/162926 net[ipfilter] Infinite loop in ipfilter with fragmented I o kern/162558 net[dummynet] [panic] seldom dummynet panics o kern/162509 net[re] [panic] Kernel panic may be related to if_re.c (r o kern/162153 net[em] intel em driver 7.2.4 don't compile o kern/162110 net[igb] [panic] RELENG_9 panics on boot in IGB driver - o kern/162028 net[ixgbe] [patch] misplaced #endif in ixgbe.c o kern/161381 net[re] RTL8169SC - re0: PHY write failed o kern/161277 net[em] [patch] BMC cannot receive IPMI traffic after loa o kern/160873 net[igb] igb(4) from HEAD fails to build on 7-STABLE o kern/160750 netIntel PRO/1000 connection breaks under load until rebo o kern/160693 net[gif] [em] Multicast packet are not passed from GIF0 t o kern/160420 net[msk] phy write timeout on HP 5310m o kern/160293 net[ieee80211] ppanic] kernel panic during network setup o kern/160206 net[gif] gifX stops working after a while (IPv6 tunnel) o kern/159817 net[udp] write UDPv4: No buffer space available (code=55) o kern/159629 net[ipsec] [panic] kernel panic with IPsec in transport m o kern/159621 net[tcp] [panic] panic: soabort: so_count o kern/159603 net[netinet] [patch] in_ifscrubprefix() - network route c o kern/159601 net[netinet] [patch] in_scrubprefix() - loopback route re o kern/159294 net[em] em watchdog timeouts o kern/159203 net[wpi] Intel 3945ABG Wireless LAN not support IBSS o kern/158930 net[bpf] BPF element leak in ifp->bpf_if->bif_dlist o kern/158726 net[ip6] [patch] ICMPv6 Router Announcement flooding limi o kern/158694 net[ix] [lagg] ix0 is not working within lagg(4) o kern/158665 net[ip6] [panic] kernel pagefault in in6_setscope() o kern/158635 net[em] TSO breaks BPF packet captures with em driver f kern/157802 net[dummynet] [panic] kernel panic in dummynet o kern/157785 netamd64 + jail + ipfw + natd = very slow outbound traffi o kern/157429 net[re] Realtek RTL8169 doesn't work with re(4) o kern/157418 net[em] em driver lockup during boot on Supermicro X9SCM- o kern/157410 net[ip6] IPv6 Router Advertisements Cause Excessive CPU U o kern/157287 net[re] [panic] INVARIANTS panic (Memory modified after f o kern/157209 net[ip6] [patch] locking error in rip6_input() (sys/netin o kern/157200 net[network.subr] [patch] stf(4) can not communicate betw o kern/157182 net[lagg] lagg interface not working together
The problem with MTU <1500
Dear all! I have two different values of uplink MTU - 1440 and 1492. And in a local network - 1500. In Cisco routers can be set 'set ip df 0', but I have no such router :( What options are there for FreeBSD? Only use the net/tcpmssd? Will it be possible to process both two channels of 1Gb? -- Vladislav V. Prodan System & Network Administrator http://support.od.ua +380 67 4584408, +380 99 4060508 VVP88-RIPE ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The problem with MTU <1500
2012/3/26 Владислав Продан > > Dear all! > I have two different values of uplink MTU - 1440 and 1492. > And in a local network - 1500. > In Cisco routers can be set 'set ip df 0', but I have no such router :( > What options are there for FreeBSD? > man ifconfig - not the 'mtu' option. You should set the MTU size on the interface to match the uplink MTU. And you should make sure you handle ICMP need-frag error messages, otherwise Path MTU discovery will be broken. Only use the net/tcpmssd? That's a workaround for not properly handling ICMP error messages, and just for TCP ;-) > Will it be possible to process both two channels of 1Gb? > > Never. ;-) Not in practice. What is the bandwidth - delay product? That's a fraction of the 1Gbps you'll never get. - M ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Intel 82550 Pro/100 Ethernet and Microcode
YongHyeon PYUN wrote: > > I've attached a patch which will show both compatibility and EEPROM > ID word and allowed loading microcode for 82550 family. Let me > know whether this patch works for you. Yes, the patch workes fine and I have for my LOM's (rev 0x0d): fxp0: port fxp0: Compatibility word : 0X0d13 fxp0: EEPROM ID word : 0X5060 and for two external cards (rev 0x0c) I see fxp1: port ... fxp1: Compatibility word : 0X040b fxp1: EEPROM ID word : 0X58a0 fxp2: port ... fxp2: Compatibility word : 0X020b fxp2: EEPROM ID word : 0X50a0 > I vaguely guess there are > differences between LOM and non-LOM implementation(Mine is > stand-alone PCI NIC). I would be able to selectively allow > microcode loading after getting compatibility/EEPROM ID word. Thank you for investigating in this problem. Regards Andreas Longwitz ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: 9-STABLE + Infiniband - incorrect interface counters
On Saturday, March 24, 2012 7:52:08 am Alex Tutubalin wrote: > Hi, > > I'm playing with two FreeBSD 9-STABLE boxes connected via 10Gbps > Infiniband (more details below) in Infiniband connected mode. > > I see incorrect interface statistics (e.g. in netstat output), output > counters are 2x more than expected. > > EXAMPLE, ftp transfer of 1 GiB file: > > ftp> put file /dev/null > local: file remote: /dev/null > 229 Entering Extended Passive Mode (|||57978|) > 150 Opening BINARY mode data connection for '/dev/null'. > 100% |***| 953 MiB 390.43 MiB/s > 00:00 ETA > 226 Transfer complete. > 10 bytes sent in 00:02 (390.13 MiB/s) > > Netstat on receiving side, counters are correct (for input): > > lexa@home-gw:/home/lexa# netstat -I ib1 5 > input (ib1) output > packets errs idrops bytespackets errs bytes colls > 0 0 0 0 0 0 0 0 > 13955 0 0 222688126 9027 01192796 0 > 48921 0 0 780832960 32129 04240596 0 > 0 0 0 0 0 0 80 0 > > Sum of bytes (input) is 1003521086, as expected. > > Netstat on sending size, output is 2x more: > > lexa@new-gw:/home/lexa# netstat -I ib0 5 > input (ib0) output > packets errs idrops bytespackets errs bytes colls > 1 0 0100 0 0 0 0 > 41162 0 02305210 62878 0 2008325984 0 > 1 0 0100 0 0 0 0 > > It looks like packet count is correct (13955+48921=62876, two packets > missed somewhere), while byte count is exact 2x more. Yes, this is a bug. if_obytes already gets incremented in IFQ_HANDOFF(), so the IB code doesn't need to do it again. Try the patch at www.freebsd.org/~jhb/patches/ipoib_obytes.patch I can't speak to the MTU issue though. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The problem with MTU <1500
You may use this patch to ipfw http://www.freebsd.org/cgi/query-pr.cgi?pr=103454 Or you may use ng_patch netgraph node (man ng_patch(4) should give you some examples) The simpliest way to do what you want is pf rule scrub in on em0 no-df performance of all those methods you should check by youre self. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The problem with MTU <1500
On 26.03.2012 17:43, Владислав Продан wrote: Dear all! I have two different values of uplink MTU - 1440 and 1492. And in a local network - 1500. In Cisco routers can be set 'set ip df 0', but I have no such router :( What options are there for FreeBSD? sysctl net.inet.tcp.path_mtu_discovery=0 Only use the net/tcpmssd? Will it be possible to process both two channels of 1Gb? -- Andrey Zonov ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: The problem with MTU <1500
On 27.03.2012 0:06, Andrey Zonov wrote: On 26.03.2012 17:43, Владислав Продан wrote: Dear all! I have two different values of uplink MTU - 1440 and 1492. And in a local network - 1500. In Cisco routers can be set 'set ip df 0', but I have no such router :( What options are there for FreeBSD? sysctl net.inet.tcp.path_mtu_discovery=0 Ooops, please ignore this. Only use the net/tcpmssd? Will it be possible to process both two channels of 1Gb? -- Andrey Zonov ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Intel 82550 Pro/100 Ethernet and Microcode
On Mon, Mar 26, 2012 at 04:37:45PM +0200, Andreas Longwitz wrote: > YongHyeon PYUN wrote: > > > > I've attached a patch which will show both compatibility and EEPROM > > ID word and allowed loading microcode for 82550 family. Let me > > know whether this patch works for you. > > Yes, the patch workes fine and I have for my LOM's (rev 0x0d): > > fxp0: port > fxp0: Compatibility word : 0X0d13 > fxp0: EEPROM ID word : 0X5060 > > and for two external cards (rev 0x0c) I see > > fxp1: port ... > fxp1: Compatibility word : 0X040b > fxp1: EEPROM ID word : 0X58a0 > fxp2: port ... > fxp2: Compatibility word : 0X020b > fxp2: EEPROM ID word : 0X50a0 > Thanks a lot! Here is final version. The patch checks whether the controller is i82550C with server extension. If driver see server NIC like yours it would allow microcode loading. I guess later 82550C controllers fixed the bug since mine has no problems to handle fragmented UDP datagrams. > > I vaguely guess there are > > differences between LOM and non-LOM implementation(Mine is > > stand-alone PCI NIC). I would be able to selectively allow > > microcode loading after getting compatibility/EEPROM ID word. > > Thank you for investigating in this problem. > > Regards > > Andreas Longwitz > Index: sys/dev/fxp/if_fxpvar.h === --- sys/dev/fxp/if_fxpvar.h (revision 233418) +++ sys/dev/fxp/if_fxpvar.h (working copy) @@ -236,6 +236,7 @@ #define FXP_FLAG_WOLCAP0x2000 /* WOL capability */ #define FXP_FLAG_WOL 0x4000 /* WOL active */ #define FXP_FLAG_RXBUG 0x8000 /* Rx lock-up bug */ +#define FXP_FLAG_NO_UCODE 0x1 /* ucode is not applicable */ /* Macros to ease CSR access. */ #defineCSR_READ_1(sc, reg) bus_read_1(sc->fxp_res[0], reg) Index: sys/dev/fxp/if_fxp.c === --- sys/dev/fxp/if_fxp.c(revision 233418) +++ sys/dev/fxp/if_fxp.c(working copy) @@ -194,7 +194,7 @@ { 0x1229, 0x08, 0, "Intel 82559 Pro/100 Ethernet" }, { 0x1229, 0x09, 0, "Intel 82559ER Pro/100 Ethernet" }, { 0x1229, 0x0c, 0, "Intel 82550 Pro/100 Ethernet" }, -{ 0x1229, 0x0d, 0, "Intel 82550 Pro/100 Ethernet" }, +{ 0x1229, 0x0d, 0, "Intel 82550C Pro/100 Ethernet" }, { 0x1229, 0x0e, 0, "Intel 82550 Pro/100 Ethernet" }, { 0x1229, 0x0f, 0, "Intel 82551 Pro/100 Ethernet" }, { 0x1229, 0x10, 0, "Intel 82551 Pro/100 Ethernet" }, @@ -525,6 +525,18 @@ sc->flags |= FXP_FLAG_WOLCAP; } + if (sc->revision == FXP_REV_82550_C) { + /* +* 82550C with server extension requires microcode to +* receive fragmented UDP datagrams. However if the +* microcode is used for client-only featured 82550C +* it locks up controller. +*/ + fxp_read_eeprom(sc, &data, 3, 1); + if ((data & 0x0400) == 0) + sc->flags |= FXP_FLAG_NO_UCODE; + } + /* Receiver lock-up workaround detection. */ if (sc->revision < FXP_REV_82558_A4) { fxp_read_eeprom(sc, &data, 3, 1); @@ -3014,10 +3026,8 @@ static uint32_t fxp_ucode_d101b0[] = D101_B0_RCVBUNDLE_UCODE; static uint32_t fxp_ucode_d101ma[] = D101M_B_RCVBUNDLE_UCODE; static uint32_t fxp_ucode_d101s[] = D101S_RCVBUNDLE_UCODE; -#ifdef notyet static uint32_t fxp_ucode_d102[] = D102_B_RCVBUNDLE_UCODE; static uint32_t fxp_ucode_d102c[] = D102_C_RCVBUNDLE_UCODE; -#endif static uint32_t fxp_ucode_d102e[] = D102_E_RCVBUNDLE_UCODE; #define UCODE(x) x, sizeof(x)/sizeof(uint32_t) @@ -3035,12 +3045,10 @@ D101M_CPUSAVER_DWORD, D101M_CPUSAVER_BUNDLE_MAX_DWORD }, { FXP_REV_82559S_A, UCODE(fxp_ucode_d101s), D101S_CPUSAVER_DWORD, D101S_CPUSAVER_BUNDLE_MAX_DWORD }, -#ifdef notyet { FXP_REV_82550, UCODE(fxp_ucode_d102), D102_B_CPUSAVER_DWORD, D102_B_CPUSAVER_BUNDLE_MAX_DWORD }, { FXP_REV_82550_C, UCODE(fxp_ucode_d102c), D102_C_CPUSAVER_DWORD, D102_C_CPUSAVER_BUNDLE_MAX_DWORD }, -#endif { FXP_REV_82551_F, UCODE(fxp_ucode_d102e), D102_E_CPUSAVER_DWORD, D102_E_CPUSAVER_BUNDLE_MAX_DWORD }, { FXP_REV_82551_10, UCODE(fxp_ucode_d102e), @@ -3055,6 +3063,9 @@ struct fxp_cb_ucode *cbp; int i; + if (sc->flags & FXP_FLAG_NO_UCODE) + return; + for (uc = ucode_table; uc->ucode != NULL; uc++) if (sc->revision == uc->revision) break; @@ -3087,6 +3098,7 @@ sc->tunable_int_delay, uc->bundle_max_offset == 0 ? 0 : sc->tunable_bundle_max); sc->flags |= FXP_FLAG_UCODE; + bzero(cbp, FXP_TXCB_SZ); } #define FXP_SYSCTL_STAT_ADD(c, h, n, p, d) \ ___ f