Re: nmbclusters: how do we want to fix this for 8.3 ?
Le 22 févr. 2012 à 22:51, Jack Vogel a écrit : > On Wed, Feb 22, 2012 at 1:44 PM, Luigi Rizzo wrote: > >> On Wed, Feb 22, 2012 at 09:09:46PM +, Ben Hutchings wrote: >>> On Wed, 2012-02-22 at 21:52 +0100, Luigi Rizzo wrote: >> ... I have hit this problem recently, too. Maybe the issue mostly/only exists on 32-bit systems. >>> >>> No, we kept hitting mbuf pool limits on 64-bit systems when we started >>> working on FreeBSD support. >> >> ok never mind then, the mechanism would be the same, though >> the limits (especially VM_LIMIT) would be different. >> Here is a possible approach: 1. nmbclusters consume the kernel virtual address space so there must be some upper limit, say VM_LIMIT = 256000 (translates to 512MB of address space) 2. also you don't want the clusters to take up too much of the >> available memory. This one would only trigger for minimal-memory systems, or virtual machines, but still... MEM_LIMIT = (physical_ram / 2) / 2048 3. one may try to set a suitably large, desirable number of buffers TARGET_CLUSTERS = 128000 4. and finally we could use the current default as the absolute minimum MIN_CLUSTERS = 1024 + maxusers*64 Then at boot the system could say nmbclusters = min(TARGET_CLUSTERS, VM_LIMIT, MEM_LIMIT) nmbclusters = max(nmbclusters, MIN_CLUSTERS) In turn, i believe interfaces should do their part and by default never try to allocate more than a fraction of the total number of buffers, >>> >>> Well what fraction should that be? It surely depends on how many >>> interfaces are in the system and how many queues the other interfaces >>> have. >> if necessary reducing the number of active queues. >>> >>> So now I have too few queues on my interface even after I increase the >>> limit. >>> >>> There ought to be a standard way to configure numbers of queues and >>> default queue lengths. >> >> Jack raised the problem that there is a poorly chosen default for >> nmbclusters, causing one interface to consume all the buffers. >> If the user explicitly overrides the value then >> the number of cluster should be what the user asks (memory permitting). >> The next step is on devices: if there are no overrides, the default >> for a driver is to be lean. I would say that topping the request between >> 1/4 and 1/8 of the total buffers is surely better than the current >> situation. Of course if there is an explicit override, then use >> it whatever happens to the others. >> >> cheers >> luigi >> > > Hmmm, well, I could make the default use only 1 queue or something like > that, > was thinking more of what actual users of the hardware would want. > I think this is more reasonable to setup interface with one queue. Even if the cluster does not hit the max you will end up with unbalanced setting that let very low mbuf count for other uses. > After the installed kernel is booted and the admin would do whatever post > install > modifications they wish it could be changed, along with nmbclusters. > > This was why i sought opinions, of the algorithm itself, but also anyone > using > ixgbe and igb in heavy use, what would you find most convenient? > > Jack > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Some performance measurements on the FreeBSD network stack
>> > > I have a patch that has been sitting around for a long time due to > review cycle latency that caches a pointer to the rtentry (and > llentry) in the the inpcb. Before each use the rtentry is checked > against a generation number in the routing tree that is incremented on > every routing table update. Hi Kip, Is there a public location for the patch ? What can be done to speedup the commit: testing ? Fabien
Re: ixgbe & if_igb RX ring locking
Le 18 oct. 2012 à 20:09, Jack Vogel a écrit : > On Thu, Oct 18, 2012 at 6:20 AM, Andre Oppermann wrote: > >> On 13.10.2012 20:22, Luigi Rizzo wrote: >> >>> On Sat, Oct 13, 2012 at 09:49:21PM +0400, Alexander V. Chernikov wrote: >>> Hello list! Packets receiving code for both ixgbe and if_igb looks like the following: ixgbe_msix_que -- ixgbe_rxeof() { IXGBE_RX_LOCK(rxr); while { get_packet; -- ixgbe_rx_input() { ++ IXGBE_RX_UNLOCK(rxr); if_input(packet); ++ IXGBE_RX_LOCK(rxr); } } IXGBE_RX_UNLOCK(rxr); } Lines marked with ++ appeared in r209068(igb) and r217593(ixgbe). These lines probably do LORs masking (if any) well. However, such change introduce quite significant performance drop: On my routing setup (nearly the same from previous -Intel 10G thread in -net) adding lock/unlock causes 2.8MPPS decrease to 2.3MPPS which is nearly 20%. >>> >>> one option could be (same as it is done in the timer >>> routine in dummynet) to build a list of all the packets >>> that need to be sent to if_input(), and then call >>> if_input with the entire list outside the lock. >>> >>> It would be even easier if we modify the various *_input() >>> routines to handle a list of mbufs instead of just one. >>> >> >> Not really. You'd just run into tons of layering complexity. >> Somewhere the decomposition and serialization has to be done. >> >> Perhaps the right place is to dequeue a batch of packets from >> the HW ring and then have a task/thread send it up the stack >> one by one. >> > > I was thinking about how to code this, something like what I did with > the refresh routine, in any case I will experiment with it. This modified version for mq polling create a list of packet that are injected later (mc is the list). http://www.gitorious.org/~fabient/freebsd/fabient-freebsd/blobs/work/pollng_mq_stable_8/sys/dev/ixgbe/ixgbe.c#line4615 > > Jack > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [patch] reducing arp locking
Le 8 nov. 2012 à 11:25, Alexander V. Chernikov a écrit : > On 08.11.2012 14:24, Andre Oppermann wrote: >> On 08.11.2012 00:24, Alexander V. Chernikov wrote: >>> Hello list! >>> >>> Currently we need to acquire 2 read locks to perform simple 6-byte >>> copying from arp record to packet >>> ethernet header. >>> >>> It seems that acquiring lle lock for fast path (main traffic flow) is >>> not necessary even with >>> current code. >>> >>> My tests shows ~10% improvement with this patch applied. >>> >>> If nobody objects I plan to commit this change at the end of next week. >> >> This is risky and prone to race conditions. The copy of the MAC address >> should be done while the table read lock is held to protect against the > It is done exactly as you say: table read lock is held. How do you protect from entry update if i've a ref to the entry ? You can end up doing bcopy of a partial mac address. la_preempt modification is also write access to an unlocked structure. > >> entry going away. You can either return with table lock held and drop >> it after the copy, or you could a modified lookup function that takes a >> pointer for the copy destination, do the copy with the read lock, and then >> return. If no entry is found an error is returned and obviously no copy >> is done. >> > > > -- > WBR, Alexander > > > ___ > freebsd-hack...@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [patch] reducing arp locking
Le 9 nov. 2012 à 10:05, Alexander V. Chernikov a écrit : > On 09.11.2012 12:51, Fabien Thomas wrote: >> >> Le 8 nov. 2012 à 11:25, Alexander V. Chernikov a écrit : >> >>> On 08.11.2012 14:24, Andre Oppermann wrote: >>>> On 08.11.2012 00:24, Alexander V. Chernikov wrote: >>>>> Hello list! >>>>> >>>>> Currently we need to acquire 2 read locks to perform simple 6-byte >>>>> copying from arp record to packet >>>>> ethernet header. >>>>> >>>>> It seems that acquiring lle lock for fast path (main traffic flow) is >>>>> not necessary even with >>>>> current code. >>>>> >>>>> My tests shows ~10% improvement with this patch applied. >>>>> >>>>> If nobody objects I plan to commit this change at the end of next week. >>>> >>>> This is risky and prone to race conditions. The copy of the MAC address >>>> should be done while the table read lock is held to protect against the >>> It is done exactly as you say: table read lock is held. >> >> How do you protect from entry update if i've a ref to the entry ? >> You can end up doing bcopy of a partial mac address. > I see no problems in copying incorrect mac address in that case: > if host mac address id updated, this is, most likely, another host, and > several packets being lost changes nothing. Sending packet to a bogus mac address is not really nothing :) > > However, there can be some realistic scenario where this can be the case (L2 > load balancing/failover). I'll update in_arpinput() to do lle > removal/insertion in that case. > >> la_preempt modification is also write access to an unlocked structure. > This one changes nothing: > current code does this under _read_ lock. Under the table lock not the entry lock ? Table lock is here to protect the table if I've understood the code correctly. If i get an exclusive reference to the entry you will end up reading and writing to the entry without any lock. > >> >> >>> >>>> entry going away. You can either return with table lock held and drop >>>> it after the copy, or you could a modified lookup function that takes a >>>> pointer for the copy destination, do the copy with the read lock, and then >>>> return. If no entry is found an error is returned and obviously no copy >>>> is done. >>>> >>> >>> >>> -- >>> WBR, Alexander >>> >>> >>> ___ >>> freebsd-hack...@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" >> >> > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [patch] reducing arp locking
Le 9 nov. 2012 à 12:18, Alexander V. Chernikov a écrit : > On 09.11.2012 13:59, Fabien Thomas wrote: >> >> Le 9 nov. 2012 à 10:05, Alexander V. Chernikov a écrit : >> >>> On 09.11.2012 12:51, Fabien Thomas wrote: >>>> >>>> Le 8 nov. 2012 à 11:25, Alexander V. Chernikov a écrit : >>>> >>>>> On 08.11.2012 14:24, Andre Oppermann wrote: >>>>>> On 08.11.2012 00:24, Alexander V. Chernikov wrote: >>>>>>> Hello list! >>>>>>> >>>>>>> Currently we need to acquire 2 read locks to perform simple 6-byte >>>>>>> copying from arp record to packet >>>>>>> ethernet header. >>>>>>> >>>>>>> It seems that acquiring lle lock for fast path (main traffic flow) is >>>>>>> not necessary even with >>>>>>> current code. >>>>>>> >>>>>>> My tests shows ~10% improvement with this patch applied. >>>>>>> >>>>>>> If nobody objects I plan to commit this change at the end of next week. >>>>>> >>>>>> This is risky and prone to race conditions. The copy of the MAC address >>>>>> should be done while the table read lock is held to protect against the >>>>> It is done exactly as you say: table read lock is held. >>>> >>>> How do you protect from entry update if i've a ref to the entry ? >>>> You can end up doing bcopy of a partial mac address. >>> I see no problems in copying incorrect mac address in that case: >>> if host mac address id updated, this is, most likely, another host, and >>> several packets being lost changes nothing. >> >> Sending packet to a bogus mac address is not really nothing :) >> >>> >>> However, there can be some realistic scenario where this can be the case >>> (L2 load balancing/failover). I'll update in_arpinput() to do lle >>> removal/insertion in that case. >>> >>>> la_preempt modification is also write access to an unlocked structure. >>> This one changes nothing: >>> current code does this under _read_ lock. >> >> Under the table lock not the entry lock ? > lle entry is read-locked while la_preempt is modified. > >> Table lock is here to protect the table if I've understood the code >> correctly. > Yes. >> If i get an exclusive reference to the entry you will end up reading and >> writing to the entry without any lock. > Yes. And the only single drawback in worst case can be sending a bit more > packets to right (but probably expired) MAC address. Or partial copy of the new mac. > > I'm talking about the following: > ARP stack is just IP -> 6 bytes mapping, there is no reason to make it > unnecessary complicated like rte, with references being held by upper layer > stack. It does not contain interface pointer, etc.. > > We may need to r/w lock entry, but for 'control plane' code only. > If one acquired exclusive lock and wants to change STATIC flag to non-static > or change lle address - this is simply wrong and has to be handled by > acquiring table wlock. > > Current ARP code has some flaws like handling arp expiration, but this patch > doesn't change much here.. In in_arpinput only exclusive access to the entry is taken during the update no IF_AFDATA_LOCK that's why i was surprised. ; > >> >>> >>>> >>>> >>>>> >>>>>> entry going away. You can either return with table lock held and drop >>>>>> it after the copy, or you could a modified lookup function that takes a >>>>>> pointer for the copy destination, do the copy with the read lock, and >>>>>> then >>>>>> return. If no entry is found an error is returned and obviously no copy >>>>>> is done. >>>>>> >>>>> >>>>> >>>>> -- >>>>> WBR, Alexander >>>>> >>>>> >>>>> ___ >>>>> freebsd-hack...@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>> To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" >>>> >>>> >>> >> >> > > > -- > WBR, Alexander > > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [patch] reducing arp locking
Le 9 nov. 2012 à 17:43, Ingo Flaschberger a écrit : > Am 09.11.2012 15:03, schrieb Fabien Thomas: >> In in_arpinput only exclusive access to the entry is taken during the update >> no IF_AFDATA_LOCK that's why i was surprised. > > what about this: I'm not against optimizing but an API that seems clear (correct this if i'm wrong): - one lock for list modification - one RW lock for lle entry access - one refcount for ptr unref is now a lot more unclear and from my point of view dangerous. My next question is why do we need a per entry lock if we use the table lock to protect entry access? Fabien > -- > --- /usr/src/sys/netinet/if_ether.c_org 2012-11-09 16:15:43.0 + > +++ /usr/src/sys/netinet/if_ether.c 2012-11-09 16:16:37.0 + > @@ -685,7 +685,7 @@ >flags |= LLE_EXCLUSIVE; >IF_AFDATA_LOCK(ifp); >la = lla_lookup(LLTABLE(ifp), flags, (struct sockaddr *)&sin); > - IF_AFDATA_UNLOCK(ifp); > + >if (la != NULL) { >/* the following is not an error when doing bridging */ >if (!bridged && la->lle_tbl->llt_ifp != ifp && !carp_match) { > @@ -697,12 +697,14 @@ >ifp->if_addrlen, (u_char *)ar_sha(ah), ":", >ifp->if_xname); >LLE_WUNLOCK(la); > + IF_AFDATA_UNLOCK(ifp); >goto reply; >} >if ((la->la_flags & LLE_VALID) && >bcmp(ar_sha(ah), &la->ll_addr, ifp->if_addrlen)) { >if (la->la_flags & LLE_STATIC) { >LLE_WUNLOCK(la); > + IF_AFDATA_UNLOCK(ifp); >if (log_arp_permanent_modify) >log(LOG_ERR, >"arp: %*D attempts to modify " > @@ -725,6 +727,7 @@ > >if (ifp->if_addrlen != ah->ar_hln) { >LLE_WUNLOCK(la); > + IF_AFDATA_UNLOCK(ifp); >log(LOG_WARNING, "arp from %*D: addr len: new %d, " >"i/f %d (ignored)\n", ifp->if_addrlen, >(u_char *) ar_sha(ah), ":", ah->ar_hln, > @@ -763,14 +766,19 @@ >la->la_numheld = 0; >memcpy(&sa, L3_ADDR(la), sizeof(sa)); >LLE_WUNLOCK(la); > + IF_AFDATA_UNLOCK(ifp); >for (; m_hold != NULL; m_hold = m_hold_next) { >m_hold_next = m_hold->m_nextpkt; >m_hold->m_nextpkt = NULL; >(*ifp->if_output)(ifp, m_hold, &sa, NULL); >} > - } else > + } else { >LLE_WUNLOCK(la); > - } > + IF_AFDATA_UNLOCK(ifp); > +} > + } else { > + IF_AFDATA_UNLOCK(ifp); > +} > reply: >if (op != ARPOP_REQUEST) >goto drop; > -- > > Kind regards, >Ingo Flaschberger > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: [patch] reducing arp locking
Le 9 nov. 2012 à 19:55, Alexander V. Chernikov a écrit : > On 09.11.2012 20:51, Fabien Thomas wrote: >> >> Le 9 nov. 2012 à 17:43, Ingo Flaschberger a écrit : >> >>> Am 09.11.2012 15:03, schrieb Fabien Thomas: >>>> In in_arpinput only exclusive access to the entry is taken during the >>>> update no IF_AFDATA_LOCK that's why i was surprised. > I'll update patch to reflect changes discussed in previous e-mails. >>> >>> what about this: >> >> I'm not against optimizing but an API that seems clear (correct this if i'm >> wrong): >> - one lock for list modification >> - one RW lock for lle entry access >> - one refcount for ptr unref >> >> is now a lot more unclear and from my point of view dangerous. > > This can be changed/documented as the following: > - table rW lock for list modification > - table rW lock lle_addr, la_expire change > - per-lle rw lock for refcount and other fields not used by 'main path' code Yes that's fine if documented and if every access to lle_addr + la_expire is under the table lock. >> >> My next question is why do we need a per entry lock if we use the table lock >> to protect entry access? > Because there are other cases, like sending traffic to unresolved rte (arp > request send, but reply is not received, and we have to maintain packets > queue to that destination). > > .. and it seems flags handling (LLE_VALID) should be done with more care. >> >> Fabien >>> >>> ___ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" >> >> ___ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" >> > > > > -- > WBR, Alexander > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: request for MFC of em/igb drivers
That fix on ixgbe would also be great to commit on ixgbe before release. This fix a crash on high packet load with bpf (mbuf freed behind bpf analysis). Fabien patch-ixgbe-bpfcrash Description: Binary data > > On 17.11.2010 23:39, Jack Vogel wrote: >> Yes, everyone, I plan on updating all the drivers, there has been no >> activity >> because I've tracking down a couple bugs that are tough, involving days >> of testing to reproduce. I know we're getting close and I appreciate any >> reports like this before. >> >> Stay tuned >> >> Jack > > Thanks for response. Do you play to MFC fixes before 8.2-RELEASE? > We are in PRERELEASE state already :-) > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: lagg/lacp poor traffic distribution
>>> Hi! >>> >>> I've loaded router using two lagg interfaces in LACP mode. >>> lagg0 has IP address and two ports (em0 and em1) and carry untagged frames. >>> lagg1 has no IP address and has two ports (igb0 and igb1) and carry >>> about 1000 dot-q vlans with lots of hosts in each vlan. >>> >>> For lagg1, lagg distributes outgoing traffic over two ports just fine. >>> For lagg0 (untagged ethernet segment with only 2 MAC addresses) >>> less than 0.07% (54Mbit/s max) of traffic goes to em0 >>> and over 99.92% goes to em1, that's bad. >>> >>> That's general traffic of several thousands of customers surfing the web, >>> using torrents etc. I've glanced over lagg/lacp sources if src/sys/net/ >>> and found nothing suspicious, it should extract and use srcIP/dstIP for >>> hash. >>> >>> How do I debug this problem? >>> >>> Eugene Grosbein >>> ___ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" >> >> I had this problem with igb driver, and I found, that lagg selects >> outgoing interface based on packet header flowid field if M_FLOWID field >> is set. And in the igb driver code flowid is set as >> >> #if __FreeBSD_version >= 80 >> <--><--><-->rxr->fmp->m_pkthdr.flowid = que->msix; >> <--><--><-->rxr->fmp->m_flags |= M_FLOWID; >> #endif >> >> The same thing in em driver with MULTIQUEUE >> >> That does not give enough number of flows to balance traffic well, so I >> commented check in if_lagg.c >> >> lagg_lb_start(struct lagg_softc *sc, struct mbuf *m) >> { >> <-->struct lagg_lb *lb = (struct lagg_lb *)sc->sc_psc; >> <-->struct lagg_port *lp = NULL; >> <-->uint32_t p = 0; >> >> //<>if (m->m_flags & M_FLOWID) >> //<><-->p = m->m_pkthdr.flowid; >> //<>else >> >> and with this change I have much better load distribution across interfaces. >> >> Hope it helps. > > You are perfectly right. By disabling flow usage I've obtained load sharing > close to even (final patch follows). Two questions: > > 1. Is it a bug or design problem? How many queues have you with igb? If it's one it will explain why the flowid is bad for load balancing with lagg. The problem is that flowid is good if the number of queue is = or a multiple of the number of lagg ports. > 2. Will I get problems like packet reordering by permanently disabling > usage of these flows in lagg(4)? > > --- if_lagg.c.orig2010-12-20 22:53:21.0 +0600 > +++ if_lagg.c 2010-12-21 13:37:20.0 +0600 > @@ -168,6 +168,11 @@ > &lagg_failover_rx_all, 0, > "Accept input from any interface in a failover lagg"); > > +int lagg_use_flows = 1; > +SYSCTL_INT(_net_link_lagg, OID_AUTO, use_flows, CTLFLAG_RW, > +&lagg_use_flows, 1, > +"Use flows for load sharing"); > + > static int > lagg_modevent(module_t mod, int type, void *data) > { > @@ -1666,7 +1671,7 @@ > struct lagg_port *lp = NULL; > uint32_t p = 0; > > - if (m->m_flags & M_FLOWID) > + if (lagg_use_flows && (m->m_flags & M_FLOWID)) > p = m->m_pkthdr.flowid; > else > p = lagg_hashmbuf(m, lb->lb_key); > --- if_lagg.h.orig2010-12-21 16:34:35.0 +0600 > +++ if_lagg.h 2010-12-21 16:35:27.0 +0600 > @@ -242,6 +242,8 @@ > int lagg_enqueue(struct ifnet *, struct mbuf *); > uint32_t lagg_hashmbuf(struct mbuf *, uint32_t); > > +extern int lagg_use_flows; > + > #endif /* _KERNEL */ > > #endif /* _NET_LAGG_H */ > --- ieee8023ad_lacp.c.orig2010-12-21 16:36:09.0 +0600 > +++ ieee8023ad_lacp.c 2010-12-21 16:35:58.0 +0600 > @@ -812,7 +812,7 @@ > return (NULL); > } > > - if (m->m_flags & M_FLOWID) > + if (lagg_use_flows && (m->m_flags & M_FLOWID)) > hash = m->m_pkthdr.flowid; > else > hash = lagg_hashmbuf(m, lsc->lsc_hashkey); > > Eugene Grosbein > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: lagg/lacp poor traffic distribution
On Dec 21, 2010, at 3:00 PM, Eugene Grosbein wrote: > On 21.12.2010 19:11, Fabien Thomas wrote: > >>>> I had this problem with igb driver, and I found, that lagg selects >>>> outgoing interface based on packet header flowid field if M_FLOWID field >>>> is set. And in the igb driver code flowid is set as >>>> >>>> #if __FreeBSD_version >= 80 >>>> <--><--><-->rxr->fmp->m_pkthdr.flowid = que->msix; >>>> <--><--><-->rxr->fmp->m_flags |= M_FLOWID; >>>> #endif >>>> >>>> The same thing in em driver with MULTIQUEUE >>>> >>>> That does not give enough number of flows to balance traffic well, so I >>>> commented check in if_lagg.c >>>> >>>> lagg_lb_start(struct lagg_softc *sc, struct mbuf *m) >>>> { >>>> <-->struct lagg_lb *lb = (struct lagg_lb *)sc->sc_psc; >>>> <-->struct lagg_port *lp = NULL; >>>> <-->uint32_t p = 0; >>>> >>>> //<>if (m->m_flags & M_FLOWID) >>>> //<><-->p = m->m_pkthdr.flowid; >>>> //<>else >>>> >>>> and with this change I have much better load distribution across >>>> interfaces. >>>> >>>> Hope it helps. >>> >>> You are perfectly right. By disabling flow usage I've obtained load sharing >>> close to even (final patch follows). Two questions: >>> >>> 1. Is it a bug or design problem? >> >> How many queues have you with igb? If it's one it will explain why the >> flowid is bad for load balancing with lagg. > > How do I know? I've read igb(4) manual page and found no words vmstat -i will show the queue (intr for the queue) normally it's the number of CPU available. > about queues within igb, nor I have any knowledge about them. > >> The problem is that flowid is good if the number of queue is = or a multiple >> of the number of lagg ports. > > Now I see, thanks. > > Eugene Grosbein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: lagg/lacp poor traffic distribution
On Dec 21, 2010, at 3:48 PM, Eugene Grosbein wrote: > On 21.12.2010 20:41, Fabien Thomas wrote: > >>>>> 1. Is it a bug or design problem? >>>> >>>> How many queues have you with igb? If it's one it will explain why the >>>> flowid is bad for load balancing with lagg. >>> >>> How do I know? I've read igb(4) manual page and found no words >> vmstat -i will show the queue (intr for the queue) normally it's the number >> of CPU available. > > # vmstat -i > interrupt total rate > irq5: uart28 0 > irq18: ehci0 uhci5+2 0 > irq19: uhci2 uhci4+ 2182 0 > irq23: uhci3 ehci1 124 0 > cpu0: timer 39576224 1993 > irq256: em0:rx 0 115571349 5822 > irq257: em0:tx 0 136632905 6883 > irq259: em1:rx 0 115829181 5835 > irq260: em1:tx 0 138838991 6994 > irq262: igb0:que 0 157354922 7927 > irq263: igb0:que 1577369 29 > irq264: igb0:que 2280207 14 > irq265: igb0:que 3241826 12 > irq266: igb0:link 2 0 > irq267: igb1:que 0 164620363 8293 > irq268: igb1:que 1238678 12 > irq269: igb1:que 2248478 12 > irq270: igb1:que 3762453 38 > irq271: igb1:link 3 0 > cpu2: timer 39576052 1993 > cpu3: timer 39576095 1993 > cpu1: timer 39575913 1993 > Total 989503327 49849 > > It seems I have four queues per igb card but only one of them works? Yes. Jack will certainly confirm but it seems that RSS hash does not seems to take vlan in account and default to queue0 ? > > Eugene Grosbein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: lagg/lacp poor traffic distribution
On Dec 22, 2010, at 6:55 PM, Eugene Grosbein wrote: > On 21.12.2010 21:57, Fabien Thomas wrote: > >>> irq262: igb0:que 0 157354922 7927 >>> irq263: igb0:que 1577369 29 >>> irq264: igb0:que 2280207 14 >>> irq265: igb0:que 3241826 12 >>> irq266: igb0:link 2 0 >>> irq267: igb1:que 0 164620363 8293 >>> irq268: igb1:que 1238678 12 >>> irq269: igb1:que 2248478 12 >>> irq270: igb1:que 3762453 38 >>> irq271: igb1:link 3 0 >>> cpu2: timer 39576052 1993 >>> cpu3: timer 39576095 1993 >>> cpu1: timer 39575913 1993 >>> Total 989503327 49849 >>> >>> It seems I have four queues per igb card but only one of them works? >> >> Yes. >> >> Jack will certainly confirm but it seems that RSS hash does not seems to >> take vlan in account and default to queue0 ? > > I've just read "Microsoft Receive-Side Scaling" documentation, > http://download.microsoft.com/download/5/d/6/5d6eaf2b-7ddf-476b-93dc-7cf0072878e6/ndis_rss.doc > > RSS defines that hash function may take IP and optionally port numbers only, > not vlan tags. > In case of PPPoE-only traffic this card's ability to classify traffic voids. > Then, unpatched lagg fails to share load over outgoing interface ports. > > It seems, we really need sysctl disabling lagg's use of flows, don't we? Yes I think that it is necessary to be able to disable it because he cant be always optimal. One improvement to the queue count would be to hash the queue id before the modulo. > > Eugene Grosbein ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Polling with multiqueue support
Ryan post on new polling modification remembered me to post a quick note about polling with multiqueue support i've done some month ago. The code is more intrusive and add a new handler for the queue. The handling of the network is nearly the same as deferred taskqueue in the drivers. There is now two pass one for the receive and one for transmit (to flush pending transmit when all receive pass done). The main gain is for packet forwarding with more than one interface. The CPU can be easily reserved for application by binding a specific number of core to the network. Performance is on par with interrupt on 10Gb or 1Gb interface and latency can be reduced by using higher HZ. Most of the time using less core achieve higher global efficiency of the system by freeing CPU cycle and reducing contention. Ex setup: 6 cores CPU, 2 ixgbe with 3 queue, 4 igb with 3 queue with 3 cores for polling: CPU0 will handle ixgbe0 queue 0, ixgbe1 queue 0, igb0 queue0, ... CPU1 will handle ixgbe0 queue 1, ... ... For those interested a test branch can be found here based on 8.x with ixgbe / igb and em modification: http://www.gitorious.org/~fabient/freebsd/fabient-sandbox/commits/work/pollng_mq_stable_8 Extracted patchset here: http://people.freebsd.org/~fabient/patch-poll_mq-20110202-stable_8 http://people.freebsd.org/~fabient/kern_poll.c-20110202 -> put to kern/kern_poll.c -- Fabien Thomas ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Polling with multiqueue support
On Feb 24, 2011, at 4:39 PM, Ryan Stone wrote: > Ah, you've anticipated me. This is just the kind of thing that I had > in mind. I have some comments: Thanks for your feedback. You pushed me from my laziness to explain the patchset on the ml. :) > > - Why allocate the poll_if_t in ether_poll_register_? If you let the > driver allocate it you don't have to worry about failure. And the > driver can embed it in its rx_ring so it doesn't have to worry about > malloc'ing it anyway. We can also put one it struct ifnet to preserve > the traditional ether_poll_register interface. Good point. I take it to my TODO list > > - I'd personally prefer it if ether_poll_register_mq didn't require a > struct ifnet to be passed in. Nothing seems to use the ifnet anymore > and here at $(WORK) we have some non-ifnets that actually register a > polling handler. Currently they just allocate a struct ifnet to make > the polling code happy but I don't see any reason to require one. To be sure to understand using context + queue id only as identifier for the mq part and grab the ifp from the context in the driver ? That seems ok to me if it block case where you dont have ifp. > > - Also, I don't quite understand why the separate TX step is necessary now. It helps because TX is done only when every interface (on this taskqueue, cross taskqueue will require sync point) have processed packets to completion. It can also help for the fairness between interface on the same taskqueue to rotate faster to the next if. This is not required and can be used or not driver per driver (if not used everything can be done on RX). There is also one fix pending for the compatibility interface: the packet per round need to be increased because there is no feedback loop on the old API. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Polling with multiqueue support
Just an update to point to another old patch that enable flowtable on the forwarding path to increase performance (reduce contention) to be on par with Linux: http://people.freebsd.org/~fabient/FreeBSDvsLinux10GB.png (forwarding 256B packets, % to line rate on 2x10Gb 82599 interface with 1xXeon W3680) http://people.freebsd.org/~fabient/patch-flowtable-forward Coupled with the polling code it perform quite well. Last things a latency / polling overhead test result: http://people.freebsd.org/~fabient/polllatency.png User app is the time it take to run a CPU related benchmark (lower is better), net load is fixed as high but let some CPU available. Freq is the HZ for polling or the measured intr frequency for that load. Latency is measured by Spirent STC. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Hello
Hi Kip, Feels good to see you again! Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Hello
Sorry for the noise, i've missed the dest... > Hi Kip, > > Feels good to see you again! > > Fabien > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
m_getjcl and packet zone
Hi all, While doing some 10Gb benchmark i've found that m_getjcl does not benefit from the packet zone. There is a ~ 80% increase in FPS when applying the following patch. 256B frame driver to driver / stable_8: - 3 765 066 FPS - 6 868 153 FPS with the patch applied. Is there a good reason to not commit this ? Fabien diff --git a/sys/sys/mbuf.h b/sys/sys/mbuf.h index 158edb4..95a44a4 100644 --- a/sys/sys/mbuf.h +++ b/sys/sys/mbuf.h @@ -523,6 +523,9 @@ m_getjcl(int how, short type, int flags, int size) struct mbuf *m, *n; uma_zone_t zone; + if (size == MCLBYTES) + return m_getcl(how, type, flags); + args.flags = flags; args.type = type; ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: TCP loopback socket fusing
Great, This will maybe kill the long time debate about "my loopback is slow vs linux" To have the best of both world what about a socket option to enable/disable fusing: can be useful when you need to see some connection "packetized". Fabien On 13 sept. 2010, at 13:33, Andre Oppermann wrote: > When a TCP connection via loopback back to localhost is made the whole > send, segmentation and receive path (with larger packets though) is still > executed. This has some considerable overhead. > > To short-circuit the send and receive sockets on localhost TCP connections > I've made a proof-of-concept patch that directly places the data in the > other side's socket buffer without doing any packetization and other protocol > overhead (like UNIX domain sockets). The connections setup (SYN, SYN-ACK, > ACK) and shutdown are still handled by normal TCP segments via loopback so > that firewalling stills works. The actual payload data during the session > won't be seen and the sequence numbers don't move other than for SYN and FIN. > The sequence are remain valid though. Obviously tcpdump won't see any data > transfers either if the connection has fused sockets. > > Preliminary testing (with WITNESS and INVARIANTS enabled) has shown stable > operation and a rough doubling of the throughput on loopback connections. > I've tested most socket teardown cases and it behaves fine. I'm not entirely > sure I've got all possible path's but the way it is integrated should properly > defuse the sockets in all situations. > > Testers and feedback wanted: > > http://people.freebsd.org/~andre/tcp_loopfuse-20100913.diff > > -- > Andre > > ___ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: TCP loopback socket fusing
On 14 sept. 2010, at 17:41, Andre Oppermann wrote: > On 14.09.2010 11:18, Fabien Thomas wrote: >> Great, >> >> This will maybe kill the long time debate about "my loopback is slow vs >> linux" >> To have the best of both world what about a socket option to enable/disable >> fusing: >> can be useful when you need to see some connection "packetized". > > A sysctl to that effect is already in the patch. yes, i'm just wondering on a per connection setting. > > -- > Andre > >> Fabien >> >> On 13 sept. 2010, at 13:33, Andre Oppermann wrote: >> >>> When a TCP connection via loopback back to localhost is made the whole >>> send, segmentation and receive path (with larger packets though) is still >>> executed. This has some considerable overhead. >>> >>> To short-circuit the send and receive sockets on localhost TCP connections >>> I've made a proof-of-concept patch that directly places the data in the >>> other side's socket buffer without doing any packetization and other >>> protocol >>> overhead (like UNIX domain sockets). The connections setup (SYN, SYN-ACK, >>> ACK) and shutdown are still handled by normal TCP segments via loopback so >>> that firewalling stills works. The actual payload data during the session >>> won't be seen and the sequence numbers don't move other than for SYN and >>> FIN. >>> The sequence are remain valid though. Obviously tcpdump won't see any data >>> transfers either if the connection has fused sockets. >>> >>> Preliminary testing (with WITNESS and INVARIANTS enabled) has shown stable >>> operation and a rough doubling of the throughput on loopback connections. >>> I've tested most socket teardown cases and it behaves fine. I'm not >>> entirely >>> sure I've got all possible path's but the way it is integrated should >>> properly >>> defuse the sockets in all situations. >>> >>> Testers and feedback wanted: >>> >>> http://people.freebsd.org/~andre/tcp_loopfuse-20100913.diff >>> >>> -- >>> Andre >>> >>> ___ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" >> >> >> > ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
For your information we have mesured 730Kpps using pollng and fastforwarding with 64bits frame without loss (<0.001% packet loss) on a Spirent Smarbits (Pentium D 2.8GHZ + 8xGig em) You can find the code / and some performance report at : http://www.netasq.com/opensource/pollng-rev1-freebsd.tgz The best performance / CPU cost ratio is to use 1 core only and the others core are free to do application processing. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Missing fix for fxp driver (FreeBSD 6.x)
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/if_fxp.c.diff?r1=1.217.2.15;r2=1.217.2.16;f=h This fix is really necessary (dealock of the interface in case of cluster shortage) and not commited in 6.x (but commited in RELENG_5 RELENG_7 and HEAD) Regards, Fabien
Re: Missing fix for fxp driver (FreeBSD 6.x)
Sorry for the noise... I've made a mistake with my local patch vs cvsweb vs commit info that seems to fix the same problem but with another code. If i've to rewrite my initial mail: The fxp deadlock (fixed on head by this commit http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/if_fxp.c.diff?r1=1.266;r2=1.267) can be easily reproduced and maybe can be MFC in 6.4 and 7.1 ? When the interface is in deadlock the only way to recover is to do a ifconfig up. Fabien http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/fxp/if_fxp.c.diff?r1=1.217.2.15;r2=1.217.2.16;f=h This fix is really necessary (dealock of the interface in case of cluster shortage) and not commited in 6.x (but commited in RELENG_5 RELENG_7 and HEAD) Regards, Fabien
Re: Interrupts + Polling mode (similar to Linux's NAPI)
To share my results: I have done at work modification to the polling code to do SMP polling (previously posted to this ml). SMP polling (dynamic group of interface binded to CPU) does not significantly improve the throughput (lock contention seems to be the cause here). The main advantage of polling with modern interface is not the PPS (which is nearly the same) but the global efficiency of the system when using multiple interfaces (which is the case for Firewall). The best configuration we have found with FreeBSD 6.3 is to do polling on one CPU and keep the other CPU free for other processing. In this configuration the whole system is more efficient than with interrupt where all the CPU are busy processing interrupt thread. Regards, Fabien
Re: Interrupts + Polling mode (similar to Linux's NAPI)
Le 28 avr. 09 à 11:04, Paolo Pisati a écrit : Fabien Thomas wrote: To share my results: I have done at work modification to the polling code to do SMP polling (previously posted to this ml). SMP polling (dynamic group of interface binded to CPU) does not significantly improve the throughput (lock contention seems to be the cause here). The main advantage of polling with modern interface is not the PPS (which is nearly the same) but the global efficiency of the system when using multiple interfaces (which is the case for Firewall). The best configuration we have found with FreeBSD 6.3 is to do polling on one CPU and keep the other CPU free for other processing. In this configuration the whole system is more efficient than with interrupt where all the CPU are busy processing interrupt thread. out of curiosity: did you try polling on 4.x? i know it doesn't "support" SMP over there, but last time i tried polling on 7.x (or was it 6.x? i don't remember...) i found it didn't gave any benefit, while switching the system to 4.x showed a huge improvement. yes rewriting the core polling code started at half because the polling code on 6.x and up perform badly (in our env) regarding performance. today 4.x is unbeatable regarding network perf (6.2 -> 7.0 at least, i need to do more test on 7_stable and 8). the other half of the work was to explore the SMP scaling of the polling code to gain what we loose with fine grained SMP kernel. -- bye, P.
Re: Interrupts + Polling mode (similar to Linux's NAPI)
I have done at work modification to the polling code to do SMP polling (previously posted to this ml). SMP polling (dynamic group of interface binded to CPU) does not significantly improve the throughput (lock contention seems to be the cause here). The main advantage of polling with modern interface is not the PPS (which is nearly the same) but the global efficiency of the system when using multiple interfaces (which is the case for Firewall). The best configuration we have found with FreeBSD 6.3 is to do polling on one CPU and keep the other CPU free for other processing. In this configuration the whole system is more efficient than with interrupt where all the CPU are busy processing interrupt thread. out of curiosity: did you try polling on 4.x? i know it doesn't "support" SMP over there, but last time i tried polling on 7.x (or was it 6.x? i don't remember...) i found it didn't gave any benefit, while switching the system to 4.x showed a huge improvement. yes rewriting the core polling code started at half because the polling code on 6.x and up perform badly (in our env) regarding performance. today 4.x is unbeatable regarding network perf (6.2 -> 7.0 at least, i need to do more test on 7_stable and 8). the other half of the work was to explore the SMP scaling of the polling code to gain what we loose with fine grained SMP kernel. The problem with all of this "analysis" is that it assumes that SMP coding scales intuitively; when the opposite is actually true. What you fail to address is the basic fact that moderated interrupts (ie holding off interrupts to a set number of ints/second) is exactly the same as polling, as on an active system you'll get exactly X interrupts per second at equal intervals. So all of this chatter about polling being more efficient is simply bunk. I agree with you with one interface. When you use ten interface it is not the case. The truth is that polling requires additional overhead to the system while interrupts do not. So if polling did better for you, its simply because either 1) The polling code in the driver is better or 2) You tuned polling better than you tuned interrupt moderation. Barney ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: pf and vimage
Thanks very useful! Do you have an "official" page to look for update. What do you think of putting it on the FreeBSD Wiki? Fabien Le 20 août 09 à 18:17, Julian Elischer a écrit : there were some people looking at adding vnet support to pf. Since we discussed it last, the rules of the game have significantly changed for the better. With the addition of some new facilitiesin FreeBSD, the work needed to virtualize a module has significantly decreased. The following doc gives the new rules.. August 17 2009 Julian Elischer === Vimage: what is it? === Vimage is a framework in the BSD kernel which allows a co-operating module to operate on multiple independent instances of its state so that it can participate in a virtual machine / virtual environment scenario. It refers to a part of the Jail infrastructure in FreeBSD. For historical reasons "Virtual network stack enabled jails"(1) are also known as "vimage enabled jails"(2) or "vnet enabled jails"(3). The currently correct term is the latter, which is a contraction of the first. In the future other parts of the system may be virtualized using the same technology and the term to cover all such components would be VIMAGE enhanced modules. The implementation approach taken by the vimage framework is a redefinition of selected global state variables to evaluate to constructs that allow for the virtualized state to be stored and resolved in appropriate instances of 'jail' specific container storage regions. The code operating on virtualized state has to conform to a set of rules described further below. Among other things in order to allow for all the changes to be conditionally compilable. i.e. permitting the virtualized code to fall back to operation on global state. The rest of this document will discuss NETWORK virtualization though the concepts may be true in the future for other parts of the system. The most visible change throughout the existing code is typically replacement of direct references to global variables with macros; foo_bar thus becomes V_foo_bar. V_foo_bar macros will resolve back to the foo_bar global in default kernel builds, and alternatively to the logical equivalent of some_base_pointer->_foo_bar for "options VIMAGE" kernel configs. Prepending of "V_" prefixes to variable references helps in visual discrimination between global and virtualized state. It is also possible to use an alternative syntax, of VNET(foo_bar) to achieve the same thing. The developers felt that V_foo_bar was less visually distracting while still providing enough clues to the reader that the variable is virtualized. In fact the V_foo_bar macro is locally defined near the definition of foo_bar to be an alias for VNET(foo_bar) so the two are not only equivalent, they are the same. The framework also extends the sysctl infrastructure to support access to virtualized state through introduction of the SYSCTL_VNET family of macros; those also automatically fall back to their standard SYSCTL counterparts in default kernel builds. Transparent libkvm(3) lookups are provided to virtualized variables which permits userland binaries such as netstat to operate unmodified on "options VIMAGE" kernels, though this may have some security implications. Vnets are associated with jails. In 8.0, every process is associated with a jail, usually the default (null) jail, and jails currently hang off of a processes ucred. This relationship defines a process's administrative affinity to a vnet and thus indirectly to all of its state. All network interfaces and sockets hold pointers back to their associated vnets. This relationship is obviously entirely independent from proc->ucred- >jail bindings. Hence, when a process opens a socket, the socket will get bound to a vnet instance hanging off of proc->ucred->jail->vnet, but once such a socket->vnet binding gets established, it cannot be changed for the entire socket lifetime. The mapping of a from a thread to a vnet should always be done via the TD_TO_VNET macro as the path may change in the future as we get more experience with using the system. Certain classes of network interfaces (Ethernet in particular) can be reassigned from one vnet to another at any time. By definition all vnets are independent and can communicate only if they are explicitly provided with communication paths. Currently mainly netgraph is used to establish inter-vnet datapaths, though other paths are being explored such as the 'epair' back-to-back virtual interface pair, in which the different sides may exist in different jails. In network traffic processing the vnet affinity is defined either by the inbound interface or by the socket / pcb -> vnet binding. However, there are many functions in the network stack that cannot implicitly fetch the vnet context from their standard arguments. Instead of explicitly extending argument lists of
new version of polling for FreeBSD 6.x
Hi, After many years of good services we will stop using FreeBSD 4.x :) During my performance regression tests under FreeBSD 6.2 i've found that polling has lower performance than interrupt. To solve that issue i've rewritten the core of polling to be more SMP ready. You can find a summary of all my tests and the source code at the following address: http://www.netasq.com/opensource/pollng-rev1-freebsd.tgz Feel free to ask more detailed information if necessary and report any bugs / comments. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new version of polling for FreeBSD 6.x
Hi, This is really interesting work! Reading the pdf file, it seems forwarding performance on 6 and 7 is still much lower than RELENG_4 ? is that correct ? ---Mike Thanks, Yes it is still slower but as you can see in the graph (programming cost) just adding a mutex drop the rate and we have some on the forwarding path. We have beaten FreeBSD 4.x with pollng on 2 core with the best throughput at 7089Mb/s but only when the test last 10s => maybe periodic task that get some CPU time. One really interesting things is that FreeBSD 7.x can have great performance: It performs slower than FreeBSD 6.x when using one CPU (4437 vs 5017) but better when using 2 CPU (5214 vs 5026). While reading the pdf i've discovered a mistake in the loss percentage: it is 0.001% and not 0.0001%. Fabien You can find a summary of all my tests and the source code at the following address: http://www.netasq.com/opensource/pollng-rev1-freebsd.tgz Feel free to ask more detailed information if necessary and report any bugs / comments. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net- [EMAIL PROTECTED]" Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 [EMAIL PROTECTED], (http://www.tancsa.com) ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new version of polling for FreeBSD 6.x
Le 8 sept. 07 à 01:05, Andre Oppermann a écrit : Mike Tancsa wrote: On Thu, 6 Sep 2007 15:12:06 +0200, in sentex.lists.freebsd.net you wrote: After many years of good services we will stop using FreeBSD 4.x :) During my performance regression tests under FreeBSD 6.2 i've found that polling has lower performance than interrupt. To solve that issue i've rewritten the core of polling to be more SMP ready. Hi, This is really interesting work! Reading the pdf file, it seems forwarding performance on 6 and 7 is still much lower than RELENG_4 ? is that correct ? Haven't tested RELENG_4 performance in a controlled environment and thus can't answer the question directly. However using fastforward on 6 and 7 is key to good performance. Without it you're stuck at some 150-200kpps, perhaps 300kpps. With it you get to 500-800kpps. Using net.isr.direct is the key success and can get much better forwarding rate (intermediate queue kill the performance). i aggree than using fastforwarding gets another big step because there is a lot less code than on the IP stack: FreeBSD 6.2 using fastforward on 64bytes packets (L3 Mb/s): pollng 1CPU:156 pollng 2CPU:123 intr: 144 pollng 1CPU fastfwd:221 pollng 2CPU fastfwd:270 intr fastfwd: 211 Fabien -- Andre ---Mike You can find a summary of all my tests and the source code at the following address: http://www.netasq.com/opensource/pollng-rev1-freebsd.tgz Feel free to ask more detailed information if necessary and report any bugs / comments. Fabien ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net- [EMAIL PROTECTED]" Mike Tancsa, Sentex communications http://www.sentex.net Providing Internet Access since 1994 [EMAIL PROTECTED], (http://www.tancsa.com) ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net- [EMAIL PROTECTED]" ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new version of polling for FreeBSD 6.x
Haven't tested RELENG_4 performance in a controlled environment and thus can't answer the question directly. However using fastforward on 6 and 7 is key to good performance. Without it you're stuck at some 150-200kpps, perhaps 300kpps. With it you get to 500-800kpps. To show that pps is mainly related to CPU freq (with high end component): FreeBSD 6.2, packet size is 64bytes and value L3 Mb/s between the two only CPU change. Xeon Woodcrest 2.333.001,287 pollng 1CPU:210 257 1,224 pollng 2CPU:329 396 1,204 pollng 1CPU fastfwd:291 364 1,251 pollng 2CPU fastfwd:455 536 1,178 warn: this is not the same hardware than in the pdf. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: new version of polling for FreeBSD 6.x
Hello Fabien, Hello :) 1- I have noticed you are not using GENERIC config file, can you provide us more information on how your KERNCONF differs from GENERIC ? I am pretty sure you have removed all the debug OPTIONs from the kernel, isn't it ? It's a GENERIC kernel conf with polling and SMP activated. With FreeBSD 7.x i've removed witness and invariant. 2- Did you get a chance to try Jumbo Frames and evaluate jumbo mbufs (I never success to make them work for me, did someone had more chance ?). In any cases, PPS values are important for such tests. Andre is right, with Fast Forward you get the best perfs for such test. No i havent done any jumbo frame test, maybe on next try. Yes fastforward is better but my goal was to stress the IP stack so i've not integrated the fastforward result in the pdf. (but you can find some results in my reply to Andre). 3- Did you monitor the DUT to see where CPU cycles were spend in each test ? Not during the real test. Profiling using hwpmc and LOCK_PROFILING have been done under the same condition but ignoring results. hwpmc use the callgraph patch published by Joseph Koshy. 4- Have you considered measuring the time it takes for an interrupt to be handled and processed by the kernel Bottom/Top Half and ISR ? [1] No 5- When I have performed some test using a Spirent SmartBits (SmartFlow) last summer I got the following results [2]. (For comparison purposes) It's really difficult to compare. For all my test i'm always using a reference hardware (not too powerful to be in the range of test tools). 6- In the test with Spirent Avalanche, you are using Lighttpd as webserver, did you enable kqueue ? how many workers ? You are using HTTP 1.0 wo Keep-Alive, what was your net.inet.tcp.msl MIB's value ? The goal of the application test was simple: i've pollng that works better than interrupt in all forwarding case but is my socket application will works better ? For that i've just installed the port with the default config (log disabled, default is one worker). The result on this test show that polling is a great benefit to network application vs interrupt (near than two times more connections per seconds). 7- Polling is known to introduce higher latency, I would expect its benefits to be less in 7-CURRENT compared to 6.x since (Scott ?) a FAST INTR Handler has been introduced around a year ago. Yes it cost more in term of packet latency (where FreeBSD 4.11 was better than 6.x / 7.x in all mode) but under high pps with interrupt the DUT is unresponsive (ithread, filter, em fastintr). Nonetheless, what you report sounds like a perf regression...Have you filled a PR ? Luigi might have a good explaination here. :-D For us polling have always worked better than interrupt under FreeBSD 4.x, under FreeBSD 6.x it is not the case and under one of my application benchmark you can see that it really have a problem to sustain the load. Behind the new model there is more than a regression fix: I think it will scale better to SMP and provide a good acceleration for packet inspection. 8- Lock profiling information were obtanied through KTR ? no: LOCK_PROFILING 9- I was wondering if you have explored Intr CPU affinity [3] and IRQ Swizzling [4] ? No, but one idea is the hybrid mode used under Solaris (intr when load is low and polling). Thanks for your efforts and your valuable contribution, best regards, /Olivier Kind regards, Fabien [1] http://xview.net/papers/interrupt_latency_scheduler/ (WARNING: Never find the time to finish that doc and publish it) && http://xview.net/research/RTBench/ [2] http://xview.net/papers/jumbo_mbufs_freebsd/ [3] http://www.intel.com/cd/ids/developer/asmo-na/eng/188935.htm? prn=3DY [4] http://download.intel.com/design/chipsets/applnots/31433702.pdf -- Olivier Warin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
pollng: pcap bench
Result of pcap benchmark requested by Vlad Galu: Using polling is better. Test setup: --- netblast -- em|fxp -- pcap_bmark under FreeBSD 6.2 Small product (fxp interface): --- pollng: Captured 30322.00 pps (total of 333542) and dropped 144 Captured 30358.45 pps (total of 333943) and dropped 219 Captured 30253.18 pps (total of 332785) and dropped 151 Captured 30276.82 pps (total of 333045) and dropped 88 Captured 30362.64 pps (total of 333989) and dropped 369 intr: Captured 0.01 pps (total of 6877442) and dropped 6876215 completly stuck with intr mode so the period take more than 10s. Large product (em interface): --- pollng: Captured 114669.09 pps (total of 1261360) and dropped 0 Captured 115263.18 pps (total of 1267895) and dropped 0 Captured 115226.45 pps (total of 1267491) and dropped 0 Captured 115003.64 pps (total of 1265040) and dropped 0 intr: Captured 99091.91 pps (total of 1090011) and dropped 629467 Captured 105180.64 pps (total of 1156987) and dropped 617526 Captured 99722.36 pps (total of 1096946) and dropped 607367 Captured 104180.91 pps (total of 1145990) and dropped 626567 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
bge drivers does not work for 3COM 3C996-SX / 3C996B-T
I've some problems with bge driver with 3COM 3C996-SX fiber card and 3C996B-T copper card under -stable: The fiber card is detected correctly but the link does not go up (i've tested the same card between two Win2K and it works well). The copper card is detected but the link goes up/down and sometimes lock the machine (reboot is needed to restart) when i start a 'ping -i0 -q'. Does someone experienced the same problems ? for the missing splx: i think i've found a new one in bge_init: static void bge_init(xsc) void *xsc; { struct bge_softc *sc = xsc; struct ifnet *ifp; u_int16_t *m; int s; s = splimp(); ifp = &sc->arpcom.ac_if; if (ifp->if_flags & IFF_RUNNING) --> missing splx ? return; Fabien smime.p7s Description: S/MIME Cryptographic Signature
Re[2]: bge driver issue
I've the same problems and i fixed partially the problem by bumping the return ring count. #define BGE_RETURN_RING_CNT 1024 -> #define BGE_RETURN_RING_CNT 2048 i dont think it is THE solution but it works better than before for me... ppn> We have a Dell poweredge 2650 (successor to 2550). ppn> We also saw the same problem with 4.5. I tried the current bge driver from 4.6 ppn> without success. The problem seems to be a size problem. When we ftp a small ppn> file, things work fine. However, when we try a 18 Megabyte file, the ftp ppn> hands and we see the problem descriped below. The linux system that came ppn> with the hardware (from dell) worked fine. ppn> BTW. This was occuring with a 100 Mbit link. ppn> I have not been able to get any resolution on this. The only replies seem to ppn> indicate that something is seriously broken with the bge driver. ppn> Paul Fronberg ppn> [EMAIL PROTECTED] >> I have a dell poweredge 2550 with which I am having all sorts of >> nasty network problems. >> The network interface will just stop responding. >> I get an error message like this: >> Jun 18 08:19:38 shekondar /kernel: bge0: watchdog timeout -- resetting >> >> This is using the broadcom 10/100/1000 NIC on the mother board, the >> intel 10/100 has had similar issues but produces no log messages. >> >> duplex and speed settings are forced on both the card and the switch. >> sometimes the kernel reset will clear the fault but sometimes you >> need to ifconfig down / up the interface to get it going again. >> >> This box has been running fine for several weeks, it is only as we have >> started to shift to production levels of traffic to it that it has started >> this. Approx 30M bits/sec out and 12M bits/sec inbound. >> >> There was a simple ipfw ruleset on the box but I have disable that >> just now to see if it helps. >> >> >> Googleing has given me people who report similar problems but no >> solutions / work arounds. >> >> Have anyone got any suggestions as to what to do next. >> >> Colin >> >> Here is the output of postconf >> >> bge0@pci1:8:0: class=0x02 card=0x00d11028 chip=0x164414e4 rev=0x12 >> hdr=0x00vendor = 'Broadcom Corporation' >> device = 'BCM5700/1 Gigabit Ethernet Controller' >> class= network >> subclass = ethernet ppn> To Unsubscribe: send mail to [EMAIL PROTECTED] ppn> with "unsubscribe freebsd-net" in the body of the message -- Cordialement, Fabienmailto:[EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
bpf_tap problem with PKTHDR
Hi, It seems there is a problem in the bpf_mtap code: Actually the code assume in the seesent case that mbuf will have a pkthdr structure. There is 2 problems here: + they did not check for that with (m_flag & M_PKTHDR) + at the upper level the caller forge fake mbuf that did not contain any pkthdr and did not initialize the m_flags field what do you think about that ? if_ethersubr.c case: /* Check for a BPF tap */ if (ifp->if_bpf != NULL) { struct m_hdr mh; /* This kludge is OK; BPF treats the "mbuf" as read-only */ mh.mh_next = m; mh.mh_data = (char *)eh; mh.mh_len = ETHER_HDR_LEN; bpf_mtap(ifp, (struct mbuf *)&mh); } bpf_mtap function: /* * Incoming linkage from device drivers, when packet is in an mbuf chain. */ void bpf_mtap(ifp, m) struct ifnet *ifp; struct mbuf *m; { struct bpf_if *bp = ifp->if_bpf; struct bpf_d *d; u_int pktlen, slen; struct mbuf *m0; pktlen = 0; for (m0 = m; m0 != 0; m0 = m0->m_next) pktlen += m0->m_len; for (d = bp->bif_dlist; d != 0; d = d->bd_next) { if (!d->bd_seesent && (m->m_pkthdr.rcvif == NULL)) continue; ++d->bd_rcount; slen = bpf_filter(d->bd_filter, (u_char *)m, pktlen, 0); if (slen != 0) catchpacket(d, (u_char *)m, pktlen, slen, bpf_mcopy); } } fabien smime.p7s Description: S/MIME Cryptographic Signature
Re: bpf_tap problem with PKTHDR
MB> I found similar problem with bpf flag BIOCSSEESENT. Here is simple MB> workaround: Yes its the same problem that i've found but it is not limited to the ethernet case. virtually each bpf_mtap must be modified to add support for a 'real' pkthdr. fabien To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
Re: Recursive encapsulation could panic the Kernel
we can use a TTL associated to the mbuf that is decremented each time we reach a possible 'point of loop'. the bad point is that we need a new entry in the mbuf... fabien VJ> Hi, VJ> With FreeBSD, there are many ways to create a recursive local encapsulation VJ> loop within the IPv4 and IPv6 stack. For example, this problem shows up when : VJ> - Netgraph with pptp is used or Netgraph with an ng_iface over UDP or any VJ> more complex Netgraph topologies... VJ> - gre interfaces VJ> - gif tunnels VJ> - ... VJ> There is a simple local solution that is used by gif_output() that is not VJ> protected by any mutex: VJ> /* VJ> * gif may cause infinite recursion calls when misconfigured. VJ> * We'll prevent this by introducing upper limit. VJ> * XXX: this mechanism may introduce another problem about VJ> * mutual exclusion of the variable CALLED, especially if we VJ> * use kernel thread. VJ> */ VJ> if (++called > max_gif_nesting) { VJ> log(LOG_NOTICE, VJ> "gif_output: recursively called too many times(%d)\n", VJ> called); VJ> m_freem(m); VJ> error = EIO;/* is there better errno? */ VJ> goto end; VJ> } VJ> I am wondering if a more generic solution could be found, however I do not VJ> have any idea yet ;-( VJ> I mean, is it possible to protect the kernel against any panic that could VJ> come from a mis-configuration of the routing tables ? VJ> Regards, VJ> Vincent VJ> To Unsubscribe: send mail to [EMAIL PROTECTED] VJ> with "unsubscribe freebsd-net" in the body of the message To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
rl driver mac address probem
Hi, I've a problem for setting the mac address for the rl driver (ifconfig rl0 ether xx:xx:xx:xx:xx:xx). The chip datasheet said that address can be read as a single byte but it must be written 4 bytes at a time. The patch correct the problem and have been grabbed from the linux driver (but have a read overflow of 2 bytes). is it possible that someone can review & commit the correction fabien --- if_rl.c.origThu Mar 6 11:32:58 2003 +++ if_rl.c Thu Mar 6 11:33:37 2003 @@ -1474,7 +1474,7 @@ struct rl_softc *sc = xsc; struct ifnet*ifp = &sc->arpcom.ac_if; struct mii_data *mii; - int s, i; + int s; u_int32_t rxcfg = 0; s = splimp(); @@ -1487,9 +1487,9 @@ rl_stop(sc); /* Init our MAC address */ - for (i = 0; i < ETHER_ADDR_LEN; i++) { - CSR_WRITE_1(sc, RL_IDR0 + i, sc->arpcom.ac_enaddr[i]); - } + CSR_WRITE_1(sc, RL_EECMD, RL_EEMODE_WRITECFG); + CSR_WRITE_4(sc, RL_IDR0, *(u_int32_t *)&sc->arpcom.ac_enaddr[0]); + CSR_WRITE_4(sc, RL_IDR4, *(u_int32_t *)&sc->arpcom.ac_enaddr[4]); /* Init the RX buffer pointer register. */ CSR_WRITE_4(sc, RL_RXADDR, vtophys(sc->rl_cdata.rl_rx_buf)); To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-net" in the body of the message
ALTQ integration
What is the status of the ALTQ framework integration into FreeBSD ? OpenBSD have native support but i think the merge with pf is a bad idea that do not allow third party classifier. fabien ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]"
em driver problem (system lock)
Hi, We use a lot of intel gigabit card and since the first time we use it we experience some strange hard lock of the system (4.9|FreeBSD-stable). We have tried several driver version (it is not related to a version). We use the card in polling mode but it seems that the problem can be fired even in interrupt mode. What i found during the debugging on a fiber card: 1) original driver did not lock but when the other end is rebooted i've around 10 linkup/linkdown 2) removing linkup/linkdown printf: driver lock each time the other end system is rebooted! 3) removing the E1000_IMC_RXSEQ in disable_intr correct the lock but i do not understand why: a) E1000_IMC_RXSEQ need to be left when disabling intr? b) the system completly lock (even under debugger) for just an interrupt source enabled? static void em_disable_intr(struct adapter *adapter) { E1000_WRITE_REG(&adapter->hw, IMC, (0x));/* & ~E1000_IMC_RXSEQ));*/ return; } What do you think of that ? fabien smime.p7s Description: S/MIME Cryptographic Signature