[IPFW] [DIVERT] IP header checksums - why calculate twice?

2013-08-28 Thread Dom F
[Copy of my post to FreeBSD Firewalls forum sent here by suggestion from 
moderator]


I've been toying with using IPDIVERT to adjust values in an IPv4 header. 
When adjusting an incoming IP header, the man page for divert(4) says:


Quote:
Packets written as incoming and having incorrect checksums will be dropped.

My main issue was with trying to leverage the optimised kernel functions 
for checksumming an IP header, for example in_cksum_hdr(). Processes 
that connect to DIVERT sockets are based in user-land so in_cksum_hdr() 
isn't readily available during compile.


Eventually the thought hit me that if some part of the kernel has to 
validate checksums (to decide whether to drop a packet) AND if my 
user-land process has to calculate a checksum to avoid its packet being 
dropped THEN surely there are two wasted checksum calculations going on?


If a root-owned process, root needed for RAW socket, can be trusted to 
inject packets back into the IP stack then surely we can skip the 
checksum test and save a few CPU cycles plus a bit of latency.


Very simple patch for /usr/src/sys/netinet/ip_divert.c (based on rev 
224575):


Code:
--- ip_divert.c.orig2013-08-26 20:52:18.0 +0100
+++ ip_divert.c 2013-08-26 20:52:44.0 +0100
@@ -496,6 +496,12 @@
/* Send packet to input processing via netisr */
switch (ip->ip_v) {
case IPVERSION:
+   /* mark mbuf as having valid checksum
+  to save userland divert process from
+  calculating checksum, and kernel having
+  to check it */
+   m->m_pkthdr.csum_flags |= CSUM_IP_CHECKED |
+ CSUM_IP_VALID;
netisr_queue_src(NETISR_IP, (uintptr_t)so, m);
break;
 #ifdef INET6
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Intel 4-port ethernet adaptor link aggregation issue

2013-08-28 Thread Joe Moog
All:

Thanks again to everybody for the responses and suggestions to our 4-port lagg 
issue. The solution (for those that may find the information of some value) was 
to set the value for kern.ipc.nmbclusters to a higher value than we had 
initially. Our previous tuning had this value set at 25600, but following a 
recommendation from the good folks at iXSystems we bumped this to a value 
closer to 200, and the 4-port lagg is functioning as expected now.

Thank you all.

Joe

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


LOR @netipsec/key.c:2434

2013-08-28 Thread Maciej Milewski

I've observed the following LOR in my logs. Just after fresh start:

lock order reversal:
 1st 0x80865bb4 sptree (fast ipsec security policy database) @ 
/data/head-git/sys/netipsec/key.c:2434

 2nd 0x80861554 rawcb (rawcb) @ /data/head-git/sys/netipsec/keysock.c:303
KDB: stack backtrace:
db_trace_thread+30 (?,?,?,?) ra cd67f7880018 sp 0 sz 0
db_trace_self+1c (?,?,?,?) ra cd67f7a00018 sp 0 sz 0
8008f828+34 (?,?,?,?) ra cd67f7b801a0 sp 0 sz 0
kdb_backtrace+44 (?,?,?,?) ra cd67f9580018 sp 0 sz 0
802defd8+34 (?,?,?,?) ra cd67f9700020 sp 0 sz 0
witness_checkorder+b0c (?,?,8061bb24,12f) ra cd67f9900050 sp 0 sz 1
__mtx_lock_flags+e8 (?,?,?,?) ra cd67f9e00030 sp 0 sz 0
key_sendup_mbuf+274 (?,?,?,?) ra cd67fa100030 sp 0 sz 0
8044f238+150 (?,?,?,?) ra cd67fa400030 sp 0 sz 0
key_parse+f6c (?,?,?,?) ra cd67fa700180 sp 0 sz 0
key_output+334 (?,?,?,?) ra cd67fbf00028 sp 0 sz 0
8036c314+8c (?,?,?,?) ra cd67fc180020 sp 0 sz 0
80451e80+28 (?,?,?,?) ra cd67fc380020 sp 0 sz 0
sosend_generic+4c4 (?,0,?,?) ra cd67fc580068 sp 1 sz 0
sosend+34 (?,?,?,?) ra cd67fcc00028 sp 0 sz 0
kern_sendit+11c (?,?,?,?) ra cd67fce80068 sp 0 sz 0
80307a48+b4 (?,?,?,?) ra cd67fd500038 sp 0 sz 0
sys_sendto+50 (?,?,?,?) ra cd67fd880040 sp 0 sz 0
trap+7f0 (?,?,?,?) ra cd67fdc800b8 sp 0 sz 0
MipsUserGenException+10c (?,?,?,40896240) ra cd67fe80 sp 0 sz 0
pid 3035
root@RSPRO:~# ps auxwww | grep 3035
root 3035  0.0  1.3 13832 1736  -  Is2:03AM   0:00.45 
/usr/local/sbin/racoon


This is on MIPS Ubiquiti RSPRO board that is configured to use IPSEC 
transport mode with IPv6. I'm using for that racoon from ipsec-tools 
package.

System is running currently: HEAD@r253582

I'll be happy with testing some patches if anyone knows why this happens 
and provides them.


--
Pozdrawiam,
Maciej Milewski

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Network stack changes

2013-08-28 Thread Alexander V. Chernikov

Hello list!

There is a lot constantly raising  discussions related to networking 
stack performance/changes.


I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is slooow, 
but we need to know why and what to do).


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).


Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either

consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does 
not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but

1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)

report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare 
and, more signifficant, we do this once
for the batch of packets received in given interrupt). However, due to 
some (im)possible deadlocks current code

does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing: 
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html


1*) Possible BPF users. Here we have one rlock if there are any readers 
present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to implement multiqueue BPF
and there is chance that we can reduce lock contention there). There is 
also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)


2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex vlan_hash with another rlock to

get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.


There are some useful terms/techniques in world of software/hardware 
routing: they have clear 'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with 
options, destined to hosts without ARP/NDP record, and similar). Latter 
one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point we are missing nearly everywhere.


What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* 
in lagg with or without vlans on top of lagg).


We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have per-driver hooks to program HW filtering.


One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in production):

http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting: 
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including 
errors) (and I can't imagine what real error we can get from _lagg_).


4) If we are router, we can do either slooow ip_input() -> ip_forward() 
-> ip_output() cycle or use optimized ip_fastfwd() which falls back to 
'slow' path for multicast/options/local traffic (e.g. works exactly like 
'data plane' part).
(Btw, we can consider net.inet.ip.fastforwarding to be turned on by 
default at least for non-IPSEC kernels)


Here we have to determine if this is local packet or not, e.g. F(dst_ip) 
returning 1 or 0. Currently we are simply using standard rlock + hash of 
iface addresses.

(And some consumers like ipfw(4) do the same, but without lock).
We don't need to do this! We can build sorted array of IPv4 addresses or 
other efficient structure on every address change and use it unlocked 
with delayed garbage collection (proof-of-concept attached)
(There is another thing to discuss: maybe we can do this once somewhere 
in ip_input and mark mbuf as 'local/non-local' ? )


5, 9) Currently we have L3 ingress/egress PFIL hooks protected by 
rmlocks. This is OK.


However, 6) and 7) are not.
Firewall can u

Re: Network stack changes

2013-08-28 Thread Jack Vogel
Very interesting material Alexander, only had time to glance at it now,
will look in more
depth later, thanks!

Jack



On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov <
melif...@yandex-team.ru> wrote:

> Hello list!
>
> There is a lot constantly raising  discussions related to networking stack
> performance/changes.
>
> I'll try to summarize current problems and possible solutions from my
> point of view.
> (Generally this is one problem: stack is slooow**,
> but we need to know why and what to do).
>
> Let's start with current IPv4 packet flow on a typical router:
> http://static.ipfw.ru/images/**freebsd_ipv4_flow.png
>
> (I'm sorry I can't provide this as text since Visio don't have any
> 'ascii-art' exporter).
>
> Note that we are using process-to-completion model, e.g. process any
> packet in ISR until it is either
> consumed by L4+ stack or dropped or put to egress NIC queue.
>
> (There is also deferred ISR model implemented inside netisr but it does
> not change much:
> it can help to do more fine-grained hashing (for GRE or other similar
> traffic), but
> 1) it uses per-packet mutex locking which kills all performance
> 2) it currently does not have _any_ hashing functions (see absence of
> flags in `netstat -Q`)
> People using 
> http://static.ipfw.ru/patches/**netisr_ip_flowid.diff(or
>  modified PPPoe/GRE version)
> report some profit, but without fixing (1) it can't help much
> )
>
> So, let's start:
>
> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since
> there is nearly no contention
> (the only thing that can happen is driver reconfiguration which is rare
> and, more signifficant, we do this once
> for the batch of packets received in given interrupt). However, due to
> some (im)possible deadlocks current code
> does per-packet ring unlock/lock (see ixgbe_rx_input()).
> There was a discussion ended with nothing: http://lists.freebsd.org/**
> pipermail/freebsd-net/2012-**October/033520.html
>
> 1*) Possible BPF users. Here we have one rlock if there are any readers
> present
> (and mutex for any matching packets, but this is more or less OK.
> Additionally, there is WIP to implement multiqueue BPF
> and there is chance that we can reduce lock contention there). There is
> also an "optimize_writers" hack permitting applications
> like CDP to use BPF as writers but not registering them as receivers
> (which implies rlock)
>
> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular
> constructions).
> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more
> funny - we use complex vlan_hash with another rlock to
> get vlan interface from underlying one.
>
> This is definitely not like things should be done and this can be changed
> more or less easily.
>
> There are some useful terms/techniques in world of software/hardware
> routing: they have clear 'control plane' and 'data plane' separation.
> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg
> hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with
> options, destined to hosts without ARP/NDP record, and similar). Latter one
> is done in hardware (or effective software implementation).
> Control plane is responsible to provide data for efficient data plane
> operations. This is the point we are missing nearly everywhere.
>
> What I want to say is: lagg is pure control-plane stuff and vlan is nearly
> the same. We can't apply this approach to complex cases like
> lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0)
> but we definitely can do this for most common setups like (igb* or ix* in
> lagg with or without vlans on top of lagg).
>
> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> some more. We even have per-driver hooks to program HW filtering.
>
> One small step to do is to throw packet to vlan interface directly (P1),
> proof-of-concept(working in production):
> http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html
>
> Another is to change lagg packet accounting: http://lists.freebsd.org/**
> pipermail/svn-src-all/2013-**April/067570.html
> Again, this is more like HW boxes do (aggregate all counters including
> errors) (and I can't imagine what real error we can get from _lagg_).
>
> 4) If we are router, we can do either slooow ip_input() -> ip_forward() ->
> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow'
> path for multicast/options/local traffic (e.g. works exactly like 'data
> plane' part).
> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by
> default at least for non-IPSEC kernels)
>
> Here we have to

Re: Network stack changes

2013-08-28 Thread Andre Oppermann

On 28.08.2013 20:30, Alexander V. Chernikov wrote:

Hello list!


Hello Alexander,

you sent quite a few things in the same email.  I'll try to respond
as much as I can right now.  Later you should split it up to have
more in-depth discussions on the individual parts.

If you could make it to the EuroBSDcon 2013 DevSummit that would be
even more awesome.  Most of the active network stack people will be
there too.


There is a lot constantly raising  discussions related to networking stack 
performance/changes.

I'll try to summarize current problems and possible solutions from my point of 
view.
(Generally this is one problem: stack is slooow, but we 
need to know why and
what to do).


Compared to others its not thaaat slow. ;)


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' 
exporter).

Note that we are using process-to-completion model, e.g. process any packet in 
ISR until it is either
consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does not 
change much:
it can help to do more fine-grained hashing (for GRE or other similar traffic), 
but
1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of flags in 
`netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified 
PPPoe/GRE version)
report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there 
is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare and, 
more signifficant, we
do this once
for the batch of packets received in given interrupt). However, due to some 
(im)possible deadlocks
current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing:
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any readers present
(and mutex for any matching packets, but this is more or less OK. Additionally, 
there is WIP to
implement multiqueue BPF
and there is chance that we can reduce lock contention there).


Rlock to rmlock?


There is also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers (which 
implies rlock)


I believe longer term we should solve this with a protocol type "ethernet"
so that one can send/receive ethernet frames through a normal socket.


2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny 
- we use complex
vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be changed more 
or less easily.


Indeed.


There are some useful terms/techniques in world of software/hardware routing: 
they have clear
'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and
some data traffic (packets with TTL=1, with options, destined to hosts without 
ARP/NDP record, and
similar). Latter one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point
we are missing nearly everywhere.


ACK.


What I want to say is: lagg is pure control-plane stuff and vlan is nearly the 
same. We can't apply
this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* in lagg 
with or without vlans
on top of lagg).


ACK.


We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some 
more. We even have
per-driver hooks to program HW filtering.


We could.  Though for vlan it looks like it would be easier to remove the
hardware vlan tag stripping and insertion.  It only adds complexity in all
drivers for no gain.


One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in
production):
http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting:
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including errors) 
(and I can't imagine
what real error we can get from _lagg_).

>

4) If we are router, we can do either slooow ip_input() -> ip_forward() -> 
ip_output() cycle or use
optimized ip_fastfwd() which falls back to 'slow' path for 
multicast/options/local traffic (e.g.
works exactly like 'data plane' part).

Re: Flow ID, LACP, and igb

2013-08-28 Thread Alan Somers
On Mon, Aug 26, 2013 at 2:40 PM, Andre Oppermann  wrote:

> On 26.08.2013 19:18, Justin T. Gibbs wrote:
>
>> Hi Net,
>>
>> I'm an infrequent traveler through the networking code and would
>> appreciate some feedback on some proposed solutions to issues Spectra
>> has seen with outbound LACP traffic.
>>
>> lacp_select_tx_port() uses the flow ID if it is available in the outbound
>> mbuf to select the outbound port.  The igb driver uses the msix queue of
>> the inbound packet to set a packet's flow ID.  This doesn't provide enough
>> bits of information to yield a high quality flow ID.  If, for example, the
>> switch controlling inbound packet distribution does a poor job, the
>> outbound
>> packet distribution will also be poorly distributed.
>>
>
> Please note that inbound and outbound flow ID do not need to be the same
> or symmetric.  It only should stay the same for all packets in a single
> connection to prevent reordering.
>
> Generally it doesn't matter if in- and outbound packets do not use the
> same queue.  Only in sophisticated setups with full affinity, which we
> don't support yet, it could matter.
>
>
>  The majority of the adapters supported by this driver will compute
>> the Toeplitz RSS hash.  Using this data seems to work quite well
>> in our tests (3 member LAGG group).  Is there any reason we shouldn't
>> use the RSS hash for flow ID?
>>
>
> Using the RSS hash is the idea.  The infrastructure and driver adjustments
> haven't been implemented throughout yet.
>
>
>  We also tried disabling the use of flow ID and doing the hash directly in
>> the driver.  Unfortunately, the current hash is pretty weak.  It
>> multiplies
>> by 33, which yield very poor distributions if you need to mod the result
>> by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
>> the FNV hash, which is already in the kernel, and this yielded much better
>> results.  He is still benchmarking the impact of this change.  Assuming we
>> can get decent flow ID data, this should only impact outbound UDP, since
>> the
>> stack doesn't provide a flow ID in this case.
>>
>> Are there other checksums we should be looking at in addition to FNV?
>>
>
> siphash24() is fast, keyed and strong.
>
I benchmarked hash32 (the existing hash function) vs fnv_hash using both
TCP and UDP, with 1500 and 9000 byte MTUs.  At 10Gbps, I couldn't measure
any difference in either throughput or cpu utilization.  Given that
siphash24 is definitely slower than hash32, there's no way that I'll find
it to be significantly faster than fnv_hash for this application.  In fact,
I'm guessing that it will be slower due to the function call overhead and
the fact that lagg_hashmbuf calls the hash function on very short buffers.
Therefore I'm going to commit the change using fnv_hash in the next few
days if no one objects.  Here's the diff:

 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.c#4 (text) 

@@ -763,7 +763,6 @@
 sc->sc_psc = (caddr_t)lsc;
 lsc->lsc_softc = sc;

-lsc->lsc_hashkey = arc4random();
 lsc->lsc_active_aggregator = NULL;
 LACP_LOCK_INIT(lsc);
 TAILQ_INIT(&lsc->lsc_aggregators);
@@ -841,7 +840,7 @@
 if (sc->use_flowid && (m->m_flags & M_FLOWID))
 hash = m->m_pkthdr.flowid;
 else
-hash = lagg_hashmbuf(sc, m, lsc->lsc_hashkey);
+hash = lagg_hashmbuf(sc, m);
 hash %= pm->pm_count;
 lp = pm->pm_map[hash];


 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.h#2 (text) 

@@ -244,7 +244,6 @@
 LIST_HEAD(, lacp_port)lsc_ports;
 struct lacp_portmaplsc_pmap[2];
 volatile u_intlsc_activemap;
-u_int32_tlsc_hashkey;
 };

 #defineLACP_TYPE_ACTORINFO1

 //SpectraBSD/stable/sys/net/if_lagg.c#9 (text) 

@@ -35,7 +35,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -1588,10 +1588,10 @@
 }

 uint32_t
-lagg_hashmbuf(struct lagg_softc *sc, struct mbuf *m, uint32_t key)
+lagg_hashmbuf(struct lagg_softc *sc, struct mbuf *m)
 {
 uint16_t etype;
-uint32_t p = key;
+uint32_t p = FNV1_32_INIT;
 int off;
 struct ether_header *eh;
 const struct ether_vlan_header *vlan;
@@ -1622,13 +1622,13 @@
 eh = mtod(m, struct ether_header *);
 etype = ntohs(eh->ether_type);
 if (sc->sc_flags & LAGG_F_HASHL2) {
-p = hash32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
-p = hash32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
+p = fnv_32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
+p = fnv_32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
 }

 /* Special handling for encapsulating VLAN frames */
 if ((m->m_flags & M_VLANTAG) && (sc->sc_flags & LAGG_F_HASHL2)) {
-p = hash32_buf(&m->m_pkthdr.ether_vtag,
+p = fnv_32_buf(&m->m_pkthdr.ether_vtag,
 sizeof(m->m_pkthdr.ether_vtag), p);
 } else if (etype == ETHERTYPE_VLAN) {
 vlan = lagg_gethdr(m, off,  sizeof(*vlan), &buf);
@@ -1636,7 +1636,7 @@
   

Re: Network stack changes

2013-08-28 Thread Slawa Olhovchenkov
On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:

> > ..
> > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the 
> > same-class hardware and
> > _userland_ forwarding.
> 
> Those numbers sound a bit far out.  Maybe if the packet isn't touched
> or looked at at all in a pure netmap interface to interface bridging
> scenario.  I don't believe these numbers.

80*64*8 = 40.960 Gb/s
May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-28 Thread Andre Oppermann

On 29.08.2013 01:42, Alan Somers wrote:

On Mon, Aug 26, 2013 at 2:40 PM, Andre Oppermann  wrote:


On 26.08.2013 19:18, Justin T. Gibbs wrote:


Hi Net,

I'm an infrequent traveler through the networking code and would
appreciate some feedback on some proposed solutions to issues Spectra
has seen with outbound LACP traffic.

lacp_select_tx_port() uses the flow ID if it is available in the outbound
mbuf to select the outbound port.  The igb driver uses the msix queue of
the inbound packet to set a packet's flow ID.  This doesn't provide enough
bits of information to yield a high quality flow ID.  If, for example, the
switch controlling inbound packet distribution does a poor job, the
outbound
packet distribution will also be poorly distributed.



Please note that inbound and outbound flow ID do not need to be the same
or symmetric.  It only should stay the same for all packets in a single
connection to prevent reordering.

Generally it doesn't matter if in- and outbound packets do not use the
same queue.  Only in sophisticated setups with full affinity, which we
don't support yet, it could matter.


  The majority of the adapters supported by this driver will compute

the Toeplitz RSS hash.  Using this data seems to work quite well
in our tests (3 member LAGG group).  Is there any reason we shouldn't
use the RSS hash for flow ID?



Using the RSS hash is the idea.  The infrastructure and driver adjustments
haven't been implemented throughout yet.


  We also tried disabling the use of flow ID and doing the hash directly in

the driver.  Unfortunately, the current hash is pretty weak.  It
multiplies
by 33, which yield very poor distributions if you need to mod the result
by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
the FNV hash, which is already in the kernel, and this yielded much better
results.  He is still benchmarking the impact of this change.  Assuming we
can get decent flow ID data, this should only impact outbound UDP, since
the
stack doesn't provide a flow ID in this case.

Are there other checksums we should be looking at in addition to FNV?



siphash24() is fast, keyed and strong.


I benchmarked hash32 (the existing hash function) vs fnv_hash using both
TCP and UDP, with 1500 and 9000 byte MTUs.  At 10Gbps, I couldn't measure
any difference in either throughput or cpu utilization.  Given that
siphash24 is definitely slower than hash32, there's no way that I'll find
it to be significantly faster than fnv_hash for this application.  In fact,
I'm guessing that it will be slower due to the function call overhead and
the fact that lagg_hashmbuf calls the hash function on very short buffers.


No problem with fnv_hash().  While I agree that it is likely that siphash24()
is slower if you could afford the time do a test run it would be great to from
guess to know.


Therefore I'm going to commit the change using fnv_hash in the next few
days if no one objects.  Here's the diff:

 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.c#4 (text) 

@@ -763,7 +763,6 @@
  sc->sc_psc = (caddr_t)lsc;
  lsc->lsc_softc = sc;

-lsc->lsc_hashkey = arc4random();
  lsc->lsc_active_aggregator = NULL;
  LACP_LOCK_INIT(lsc);
  TAILQ_INIT(&lsc->lsc_aggregators);
@@ -841,7 +840,7 @@
  if (sc->use_flowid && (m->m_flags & M_FLOWID))
  hash = m->m_pkthdr.flowid;
  else
-hash = lagg_hashmbuf(sc, m, lsc->lsc_hashkey);
+hash = lagg_hashmbuf(sc, m);
  hash %= pm->pm_count;
  lp = pm->pm_map[hash];


The reason for the hashkey was to prevent directed "attacks" on the load
balancing by choosing/predicting the outcome of it.  This is good and bad
as it is undeterministic between runs, which makes debugging particular
situations harder.  To work around the lack of key for fnv_hash() XOR'ing
the hash output with a pre-initialized random is likely sufficient.
The true importance of this randomization is debatable and just point out
why it was there, not to object to you removing it.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Network stack changes

2013-08-28 Thread Bryan Venteicher


- Original Message -
> On 28.08.2013 20:30, Alexander V. Chernikov wrote:
> > Hello list!
> 
> Hello Alexander,
> 
> you sent quite a few things in the same email.  I'll try to respond
> as much as I can right now.  Later you should split it up to have
> more in-depth discussions on the individual parts.
> 

> 
> > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> > some more. We even have
> > per-driver hooks to program HW filtering.
> 
> We could.  Though for vlan it looks like it would be easier to remove the
> hardware vlan tag stripping and insertion.  It only adds complexity in all
> drivers for no gain.
> 

In the shorter term, can we remove the requirement for the parent
interface to support IFCAP_VLAN_HWTAGGING in order to do checksum
offloading on the VLAN interface (see vlan_capabilities())?

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"