Hello list!

There is a lot constantly raising discussions related to networking stack performance/changes.

I'll try to summarize current problems and possible solutions from my point of view. (Generally this is one problem: stack is slooooooooooooooooooooooooooow, but we need to know why and what to do).

Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' exporter).

Note that we are using process-to-completion model, e.g. process any packet in ISR until it is either
consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does not change much: it can help to do more fine-grained hashing (for GRE or other similar traffic), but
1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of flags in `netstat -Q`) People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified PPPoe/GRE version)
report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there is nearly no contention (the only thing that can happen is driver reconfiguration which is rare and, more signifficant, we do this once for the batch of packets received in given interrupt). However, due to some (im)possible deadlocks current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing: http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any readers present (and mutex for any matching packets, but this is more or less OK. Additionally, there is WIP to implement multiqueue BPF and there is chance that we can reduce lock contention there). There is also an "optimize_writers" hack permitting applications like CDP to use BPF as writers but not registering them as receivers (which implies rlock)

2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions). Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny - we use complex vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be changed more or less easily.

There are some useful terms/techniques in world of software/hardware routing: they have clear 'control plane' and 'data plane' separation. Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with options, destined to hosts without ARP/NDP record, and similar). Latter one is done in hardware (or effective software implementation). Control plane is responsible to provide data for efficient data plane operations. This is the point we are missing nearly everywhere.

What I want to say is: lagg is pure control-plane stuff and vlan is nearly the same. We can't apply this approach to complex cases like lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0) but we definitely can do this for most common setups like (igb* or ix* in lagg with or without vlans on top of lagg).

We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some more. We even have per-driver hooks to program HW filtering.

One small step to do is to throw packet to vlan interface directly (P1), proof-of-concept(working in production):
http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting: http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html Again, this is more like HW boxes do (aggregate all counters including errors) (and I can't imagine what real error we can get from _lagg_).

4) If we are router, we can do either slooow ip_input() -> ip_forward() -> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow' path for multicast/options/local traffic (e.g. works exactly like 'data plane' part). (Btw, we can consider net.inet.ip.fastforwarding to be turned on by default at least for non-IPSEC kernels)

Here we have to determine if this is local packet or not, e.g. F(dst_ip) returning 1 or 0. Currently we are simply using standard rlock + hash of iface addresses.
(And some consumers like ipfw(4) do the same, but without lock).
We don't need to do this! We can build sorted array of IPv4 addresses or other efficient structure on every address change and use it unlocked with delayed garbage collection (proof-of-concept attached) (There is another thing to discuss: maybe we can do this once somewhere in ip_input and mark mbuf as 'local/non-local' ? )

5, 9) Currently we have L3 ingress/egress PFIL hooks protected by rmlocks. This is OK.

However, 6) and 7) are not.
Firewall can use the same pfil lock as reader protection without imposing its own lock. currently pfil&ipfw code is ready to do this.

8) Radix/rt* api. This is probably the worst place in entire stack. It is toooo generic, tooo slow and buggy (do you use IPv6? you definitely know what I'm talking about). A) It really is too generic and assumption that it can be (effectively) used for every family is wrong. Two examples: we don't need to lookup all 128 bits of IPv6 address. Subnets with mask >/64 are not used widely (actually the only reason to use them are p2p links due to ND potential problems). One of common solutions is to lookup 64bits, and build another trie (or other structure) in case of collision. Another example is MPLS where we can simply do direct array lookup based on ingress label.

B) It is terribly slow (AFAIR luigi@ did some performance management, numbers available in one of netmap pdfs) C) It is not multipath-capable. Stateful (and non-working) multipath is definitely not the right way.

8*) rtentry
We are doing it wrong.
Currently _every_ lookup locks/unlocks given rte twice.
First lock is related to and old-old story for trusting IP redirects (and auto-adding host routes for them). Hopefully currently it is disabled automatically when you turn forwarding on. The second one is much more complicated: we are assuming that rte's with non-zero refcount value can stop egress interface from being destroyed.
This is wrong (but widely used) assumption.

We can use delayed GC instead of locking for rte's and this won't break things more than they are broken now (patch attached).
We can't do the same for ifp structures since
a) virtual ones can assume some state in underlying physical NIC
b) physical ones just _can_ be destroyed (maybe regardless of user wants this or not, like: SFP being unplugged from NIC) or simply lead to kernel crash due to SW/HW inconsistency

One of possible solution is to implement stable refcounts based on PCPU counters, and apply thos counters to ifp, but seem to be non-trivial.


Another rtalloc(9) problem is the fact that radix is used as both 'control plane' and 'data plane' structure/api. Some users always want to put more information in rte, while others
want to make rte more compact. We just need _different_ structures for that.
Feature-rich, lot-of-data control plane one (to store everything we want to store, including, for example, PID of process originating the route) - current radix can be modified to do this. And address-family-depended another structure (array, trie, or anything) which contains _only_ data necessary to put packet on the wire.

11) arpresolve. Currently (this was decoupled in 8.x) we have
a) ifaddr rlock
b) lle rlock.

We don't need those locks.
We need to
a) make lle layer per-interface instead of global (and this can also solve multiple fibs and L2 mappings done in fib.0 issue)
b) use rtalloc(9)-provided lock instead of separate locking
c) actually, we need to do rewrite this layer because
d) lle actually is the place to do real multipath:

briefly,
you have rte pointing to some special nexthop structure pointing to lle, which has the following data: num_of_egress_ifaces: [ifindex1, ifindex2, ifindex3] | L2 data to prepend to header
Separate post will follow.

With the following, we can achieve lagg traffic distribution without actually using lagg_transmit and similar stuff (at least in most common scenarious) (for example, TCP output definitely can benefit from this, since we can account flowid once for TCP session and use in in every mbuf)


So. Imagine we have done all this. How we can estimate the difference?

There was a thread, started a year ago, describing 'stock' performance and difference for various modifications.
It is done on 8.x, however I've got similar results on recent 9.x

http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032680.html

Briefly:

2xE5645 @ Intel 82599 NIC.
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewallIxia XM2 (traffic generator) <> ix0 (FreeBSD). Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to destinations in vlan11 (10.100.1.128 - 10.100.1.192). Static arps are configured for all destination addresses. Traffic level is slightly above or slightly below system performance.

we start from 1.4MPPS (if we are using several routes to minimize mutex contention).

My 'current' result for the same test, on same HW, with the following modifications:

* 1) ixgbe per-packet ring unlock removed
* P1) ixgbe is modified to do direct vlan input (so 2,3 are not used)
* 4) separate lockless in_localip() version
* 6) - using existing pfil lock
* 7) using lockless version
* 8) radix converted to use rmlock instead of rlock. Delayed GC is used instead of mutexes
* 10) - using existing pfil lock
* 11) using radix lock to do arpresolve(). Not using lle rlock

(so the rmlocks are the only locks used on data path).

Additionally, ipstat counters are converted to PCPU (no real performance implications).
ixgbe does not do per-packet accounting (as in head).
if_vlan counters are converted to PCPU
lagg is converted to rmlock, per-packet accounting is removed (using stat from underlying interfaces) lle hash size is bumped to 1024 instead of 32 (not applicable here, but slows things down for large L2 domains)

The result is 5.6 MPPS for single port (11 cores) and 6.5MPPS for lagg (16 cores), nearly the same for HT on and 22 cores.

..
while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the same-class hardware and _userland_ forwarding.

One of key features making all such products possible (DPDK, netmap, packetshader, Cisco SW forwarding) - is use of batching instead of process-to-completion model. Batching mitigates locking cost, batching does not wash out CPU cache, and so on.

So maybe we can consider passing batches from NIC to at least L2 layer with netisr? or even up to ip_input() ?

Another question is about making some sort of reliable GC like ("passive serialization" or other similar not-to-pronounce-words about Linux and lockless objects).


P.S. Attached patches are 1) for 8.x 2) mostly 'hacks' showing roughly how can this be done and what benefit can be achieved.









commit 20a52503455c80cd149d2232bdc0d37e14381178
Author: Charlie Root <r...@test15.yandex.net>
Date:   Tue Oct 23 21:20:13 2012 +0000

    Remove RX ring unlock/lock before calling if_input() from ixgbe drivers.

diff --git a/sys/dev/ixgbe/ixgbe.c b/sys/dev/ixgbe/ixgbe.c
index 5d8752b..fc1491e 100644
--- a/sys/dev/ixgbe/ixgbe.c
+++ b/sys/dev/ixgbe/ixgbe.c
@@ -4171,9 +4171,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *ifp, 
struct mbuf *m, u32 ptype
                         if (tcp_lro_rx(&rxr->lro, m, 0) == 0)
                                 return;
         }
-       IXGBE_RX_UNLOCK(rxr);
         (*ifp->if_input)(ifp, m);
-       IXGBE_RX_LOCK(rxr);
 }
 
 static __inline void
Index: sys/dev/ixgbe/ixgbe.c
===================================================================
--- sys/dev/ixgbe/ixgbe.c       (revision 248704)
+++ sys/dev/ixgbe/ixgbe.c       (working copy)
@@ -2880,6 +2880,14 @@ ixgbe_allocate_queues(struct adapter *adapter)
                        error = ENOMEM;
                        goto err_rx_desc;
                }
+
+               if ((rxr->vlans = malloc(sizeof(struct ifvlans), M_DEVBUF,
+                   M_NOWAIT | M_ZERO)) == NULL) {
+                       device_printf(dev,
+                           "Critical Failure setting up vlan index\n");
+                       error = ENOMEM;
+                       goto err_rx_desc;
+               }
        }
 
        /*
@@ -4271,6 +4279,11 @@ ixgbe_free_receive_buffers(struct rx_ring *rxr)
                rxr->ptag = NULL;
        }
 
+       if (rxr->vlans != NULL) {
+               free(rxr->vlans, M_DEVBUF);
+               rxr->vlans = NULL;
+       }
+
        return;
 }
 
@@ -4303,7 +4316,7 @@ ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *
                                 return;
         }
        IXGBE_RX_UNLOCK(rxr);
-        (*ifp->if_input)(ifp, m);
+        (*ifp->if_input)(m->m_pkthdr.rcvif, m);
        IXGBE_RX_LOCK(rxr);
 }
 
@@ -4360,6 +4373,7 @@ ixgbe_rxeof(struct ix_queue *que)
        u16                     count = rxr->process_limit;
        union ixgbe_adv_rx_desc *cur;
        struct ixgbe_rx_buf     *rbuf, *nbuf;
+       struct ifnet            *ifp_dst;
 
        IXGBE_RX_LOCK(rxr);
 
@@ -4522,9 +4536,19 @@ ixgbe_rxeof(struct ix_queue *que)
                            (staterr & IXGBE_RXD_STAT_VP))
                                vtag = le16toh(cur->wb.upper.vlan);
                        if (vtag) {
-                               sendmp->m_pkthdr.ether_vtag = vtag;
-                               sendmp->m_flags |= M_VLANTAG;
-                       }
+                               ifp_dst = rxr->vlans->idx[EVL_VLANOFTAG(vtag)];
+
+                               if (ifp_dst != NULL) {
+                                       ifp_dst->if_ipackets++;
+                                       sendmp->m_pkthdr.rcvif = ifp_dst;
+                               } else {
+                                       sendmp->m_pkthdr.ether_vtag = vtag;
+                                       sendmp->m_flags |= M_VLANTAG;
+                                       sendmp->m_pkthdr.rcvif = ifp;
+                               }
+                       } else
+                               sendmp->m_pkthdr.rcvif = ifp;
+
                        if ((ifp->if_capenable & IFCAP_RXCSUM) != 0)
                                ixgbe_rx_checksum(staterr, sendmp, ptype);
 #if __FreeBSD_version >= 800000
@@ -4625,7 +4649,32 @@ ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u
        return;
 }
 
+/*
+ * This routine gets real vlan ifp based on
+ * underlying ifp and vlan tag.
+ */
+static struct ifnet *
+ixgbe_get_vlan(struct ifnet *ifp, uint16_t vtag)
+{
 
+       /* XXX: IFF_MONITOR */
+#if 0
+       struct lagg_port *lp = ifp->if_lagg;
+       struct lagg_softc *sc = lp->lp_softc;
+
+       /* Skip lagg nesting */
+       while (ifp->if_type == IFT_IEEE8023ADLAG) {
+               lp = ifp->if_lagg;
+               sc = lp->lp_softc;
+               ifp = sc->sc_ifp;
+       }
+#endif
+       /* Get vlan interface based on tag */
+       ifp = VLAN_DEVAT(ifp, vtag);
+
+       return (ifp);
+}
+
 /*
 ** This routine is run via an vlan config EVENT,
 ** it enables us to use the HW Filter table since
@@ -4637,7 +4686,9 @@ static void
 ixgbe_register_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
        struct adapter  *adapter = ifp->if_softc;
-       u16             index, bit;
+       u16             index, bit, j;
+       struct rx_ring  *rxr;
+       struct ifnet    *ifv;
 
        if (ifp->if_softc !=  arg)   /* Not our event */
                return;
@@ -4645,7 +4696,20 @@ ixgbe_register_vlan(void *arg, struct ifnet *ifp,
        if ((vtag == 0) || (vtag > 4095))       /* Invalid */
                return;
 
+       ifv = ixgbe_get_vlan(ifp, vtag);
+
        IXGBE_CORE_LOCK(adapter);
+
+       if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
+               rxr = adapter->rx_rings;
+
+               for (j = 0; j < adapter->num_queues; j++, rxr++) {
+                       IXGBE_RX_LOCK(rxr);
+                       rxr->vlans->idx[vtag] = ifv;
+                       IXGBE_RX_UNLOCK(rxr);
+               }
+       }
+
        index = (vtag >> 5) & 0x7F;
        bit = vtag & 0x1F;
        adapter->shadow_vfta[index] |= (1 << bit);
@@ -4663,7 +4727,8 @@ static void
 ixgbe_unregister_vlan(void *arg, struct ifnet *ifp, u16 vtag)
 {
        struct adapter  *adapter = ifp->if_softc;
-       u16             index, bit;
+       u16             index, bit, j;
+       struct rx_ring  *rxr;
 
        if (ifp->if_softc !=  arg)
                return;
@@ -4672,6 +4737,15 @@ ixgbe_unregister_vlan(void *arg, struct ifnet *ifp
                return;
 
        IXGBE_CORE_LOCK(adapter);
+
+       rxr = adapter->rx_rings;
+
+       for (j = 0; j < adapter->num_queues; j++, rxr++) {
+               IXGBE_RX_LOCK(rxr);
+               rxr->vlans->idx[vtag] = NULL;
+               IXGBE_RX_UNLOCK(rxr);
+       }
+
        index = (vtag >> 5) & 0x7F;
        bit = vtag & 0x1F;
        adapter->shadow_vfta[index] &= ~(1 << bit);
@@ -4686,8 +4760,8 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
 {
        struct ifnet    *ifp = adapter->ifp;
        struct ixgbe_hw *hw = &adapter->hw;
+       u32             ctrl, j;
        struct rx_ring  *rxr;
-       u32             ctrl;
 
 
        /*
@@ -4713,6 +4787,15 @@ ixgbe_setup_vlan_hw_support(struct adapter *adapte
        if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) {
                ctrl &= ~IXGBE_VLNCTRL_CFIEN;
                ctrl |= IXGBE_VLNCTRL_VFE;
+       } else {
+               /* Zero vlan table */
+               rxr = adapter->rx_rings;
+
+               for (j = 0; j < adapter->num_queues; j++, rxr++) {
+                       IXGBE_RX_LOCK(rxr);
+                       memset(rxr->vlans->idx, 0, sizeof(struct ifvlans));
+                       IXGBE_RX_UNLOCK(rxr);
+               }
        }
        if (hw->mac.type == ixgbe_mac_82598EB)
                ctrl |= IXGBE_VLNCTRL_VME;
Index: sys/dev/ixgbe/ixgbe.h
===================================================================
--- sys/dev/ixgbe/ixgbe.h       (revision 248704)
+++ sys/dev/ixgbe/ixgbe.h       (working copy)
@@ -284,6 +284,11 @@ struct ix_queue {
        u64                     irqs;
 };
 
+struct ifvlans {
+       struct ifnet            *idx[4096];
+};
+
+
 /*
  * The transmit ring, one per queue
  */
@@ -307,7 +312,6 @@ struct tx_ring {
        }                       queue_status;
        u32                     txd_cmd;
        bus_dma_tag_t           txtag;
-       char                    mtx_name[16];
 #ifndef IXGBE_LEGACY_TX
        struct buf_ring         *br;
        struct task             txq_task;
@@ -324,6 +328,7 @@ struct tx_ring {
        unsigned long           no_tx_dma_setup;
        u64                     no_desc_avail;
        u64                     total_packets;
+       char                    mtx_name[16];
 };
 
 
@@ -346,8 +351,8 @@ struct rx_ring {
        u16                     num_desc;
        u16                     mbuf_sz;
        u16                     process_limit;
-       char                    mtx_name[16];
        struct ixgbe_rx_buf     *rx_buffers;
+       struct ifvlans          *vlans;
        bus_dma_tag_t           ptag;
 
        u32                     bytes; /* Used for AIM calc */
@@ -363,6 +368,7 @@ struct rx_ring {
 #ifdef IXGBE_FDIR
        u64                     flm;
 #endif
+       char                    mtx_name[16];
 };
 
 /* Our adapter structure */
commit 7f1103ac622881182642b2d3ae17b6ff484c1293
Author: Charlie Root <r...@test15.yandex.net>
Date:   Sun Apr 7 23:50:26 2013 +0000

    Use lockles in_localip_fast() function.

diff --git a/sys/net/route.h b/sys/net/route.h
index 4d9371b..f588f03 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -365,6 +365,7 @@ void         rt_maskedcopy(struct sockaddr *, struct 
sockaddr *, struct sockaddr *);
  */
 #define RTGC_ROUTE     1
 #define RTGC_IF                3
+#define        RTGC_IFADDR     4
 
 
 int     rtexpunge(struct rtentry *);
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index 5341918..a83b8a9 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -93,6 +93,20 @@ VNET_DECLARE(struct inpcbinfo, ripcbinfo);
 VNET_DECLARE(struct arpstat, arpstat);  /* ARP statistics, see if_arp.h */
 #define        V_arpstat               VNET(arpstat)
 
+struct in_ifaddrf {
+       struct in_ifaddrf *next;
+       struct in_addr addr;
+};
+
+struct in_ifaddrhashf {
+       uint32_t hmask;
+       uint32_t count;
+       struct in_ifaddrf **hash;
+};
+
+VNET_DEFINE(struct in_ifaddrhashf *, in_ifaddrhashtblf) = NULL; /* inet addr 
fast hash table */
+#define        V_in_ifaddrhashtblf     VNET(in_ifaddrhashtblf)
+
 /*
  * Return 1 if an internet address is for a ``local'' host
  * (one to which we have a connection).  If subnetsarelocal
@@ -145,6 +159,120 @@ in_localip(struct in_addr in)
        return (0);
 }
 
+int
+in_localip_fast(struct in_addr in)
+{
+       struct in_ifaddrf *rec;
+       struct in_ifaddrhashf *f;
+
+       if ((f = V_in_ifaddrhashtblf) == NULL)
+               return (0);
+
+       rec = f->hash[INADDR_HASHVAL(in) & f->hmask];
+
+       while (rec != NULL && rec->addr.s_addr != in.s_addr)
+               rec = rec->next;
+
+       if (rec != NULL)
+               return (1);
+
+       return (0);
+}
+
+struct in_ifaddrhashf *
+in_hash_alloc(int additional)
+{
+       int count, hsize, i;
+       struct in_ifaddr *ia;
+       struct in_ifaddrhashf *new;
+
+       count = additional + 1;
+
+       IN_IFADDR_RLOCK();
+       for (i = 0; i < INADDR_NHASH; i++) {
+               LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash)
+                       count++;
+       }
+       IN_IFADDR_RUNLOCK();
+
+       /* roundup to the next power of 2 */
+       hsize = (1UL << flsl(count - 1));
+
+       new = malloc(sizeof(struct in_ifaddrhashf) +
+           sizeof(void *) * hsize +
+           sizeof(struct in_ifaddrf) * count, M_IFADDR,
+           M_NOWAIT | M_ZERO);
+
+       if (new == NULL)
+               return (NULL);
+
+       new->count = count;
+       new->hmask = hsize - 1;
+       new->hash = (struct in_ifaddrf **)(new + 1);
+
+       return (new);
+}
+
+int
+in_hash_build(struct in_ifaddrhashf *new)
+{
+       struct in_ifaddr *ia;
+       int i, j, count, hsize, r;
+       struct in_ifaddrhashf *old;
+       struct in_ifaddrf *rec, *tmp;
+
+       count = new->count - 1;
+       hsize = new->hmask + 1;
+       rec = (struct in_ifaddrf *)&new->hash[hsize];
+
+       IN_IFADDR_RLOCK();
+       for (i = 0; i < INADDR_NHASH; i++) {
+               LIST_FOREACH(ia, &V_in_ifaddrhashtbl[i], ia_hash) {
+                       rec->addr.s_addr = IA_SIN(ia)->sin_addr.s_addr;
+
+                       j = INADDR_HASHVAL(rec->addr) & new->hmask;
+                       if ((tmp = new->hash[j]) == NULL)
+                               new->hash[j] = rec;
+                       else {
+                               while (tmp->next)
+                                       tmp = tmp->next;
+                               tmp->next = rec;
+                       }
+
+                       rec++;
+                       count--;
+
+                       /* End of memory */
+                       if (count < 0)
+                               break;
+               }
+
+               /* End of memory */
+               if (count < 0)
+                       break;
+       }
+       IN_IFADDR_RUNLOCK();
+
+       /* If count >0 then we succeeded in building hash. Stop cycle */
+
+       if (count >= 0) {
+               old = V_in_ifaddrhashtblf;
+               V_in_ifaddrhashtblf = new;
+
+               rtgc_free(RTGC_IFADDR, old, 0);
+
+               return (1);
+       }
+
+       /* Fail. */
+       if (new)
+               free(new, M_IFADDR);
+
+       return (0);
+}
+
+
+
 /*
  * Determine whether an IP address is in a reserved set of addresses
  * that may not be forwarded, or whether datagrams to that destination
@@ -239,6 +367,7 @@ in_control(struct socket *so, u_long cmd, caddr_t data, 
struct ifnet *ifp,
        struct sockaddr_in oldaddr;
        int error, hostIsNew, iaIsNew, maskIsNew;
        int iaIsFirst;
+       struct in_ifaddrhashf *new_hash;
 
        ia = NULL;
        iaIsFirst = 0;
@@ -405,6 +534,11 @@ in_control(struct socket *so, u_long cmd, caddr_t data, 
struct ifnet *ifp,
                                goto out;
                        }
 
+                       if ((new_hash = in_hash_alloc(1)) == NULL) {
+                               error = ENOBUFS;
+                               goto out;
+                       }
+
                        ifa = &ia->ia_ifa;
                        ifa_init(ifa);
                        ifa->ifa_addr = (struct sockaddr *)&ia->ia_addr;
@@ -427,6 +561,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, 
struct ifnet *ifp,
                        IN_IFADDR_WLOCK();
                        TAILQ_INSERT_TAIL(&V_in_ifaddrhead, ia, ia_link);
                        IN_IFADDR_WUNLOCK();
+
+                       in_hash_build(new_hash);
                        iaIsNew = 1;
                }
                break;
@@ -649,6 +785,8 @@ in_control(struct socket *so, u_long cmd, caddr_t data, 
struct ifnet *ifp,
                        ifa_free(&if_ia->ia_ifa);
        } else
                IN_IFADDR_WUNLOCK();
+       if ((new_hash = in_hash_alloc(0)) != NULL)
+               in_hash_build(new_hash);
        ifa_free(&ia->ia_ifa);                          /* in_ifaddrhead */
 out:
        if (ia != NULL)
@@ -852,6 +990,7 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct 
sockaddr_in *sin,
        register u_long i = ntohl(sin->sin_addr.s_addr);
        struct sockaddr_in oldaddr;
        int s = splimp(), flags = RTF_UP, error = 0;
+       struct in_ifaddrhashf *new_hash;
 
        oldaddr = ia->ia_addr;
        if (oldaddr.sin_family == AF_INET)
@@ -862,6 +1001,9 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct 
sockaddr_in *sin,
                LIST_INSERT_HEAD(INADDR_HASH(ia->ia_addr.sin_addr.s_addr),
                    ia, ia_hash);
                IN_IFADDR_WUNLOCK();
+
+               if ((new_hash = in_hash_alloc(1)) != NULL)
+                       in_hash_build(new_hash);
        }
        /*
         * Give the interface a chance to initialize
@@ -887,6 +1029,8 @@ in_ifinit(struct ifnet *ifp, struct in_ifaddr *ia, struct 
sockaddr_in *sin,
                                 */
                                LIST_REMOVE(ia, ia_hash);
                        IN_IFADDR_WUNLOCK();
+                       if ((new_hash = in_hash_alloc(1)) != NULL)
+                               in_hash_build(new_hash);
                        return (error);
                }
        }
diff --git a/sys/netinet/in.h b/sys/netinet/in.h
index b03e74c..948938a 100644
--- a/sys/netinet/in.h
+++ b/sys/netinet/in.h
@@ -741,6 +741,7 @@ int  in_broadcast(struct in_addr, struct ifnet *);
 int     in_canforward(struct in_addr);
 int     in_localaddr(struct in_addr);
 int     in_localip(struct in_addr);
+int     in_localip_fast(struct in_addr);
 int     inet_aton(const char *, struct in_addr *); /* in libkern */
 char   *inet_ntoa(struct in_addr); /* in libkern */
 char   *inet_ntoa_r(struct in_addr ina, char *buf); /* in libkern */
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index 692e3e5..f7734a9 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -347,7 +347,7 @@ ip_fastforward(struct mbuf *m)
        /*
         * Is it for a local address on this host?
         */
-       if (in_localip(ip->ip_dst))
+       if (in_localip_fast(ip->ip_dst))
                return m;
 
        //IPSTAT_INC(ips_total);
@@ -390,7 +390,7 @@ ip_fastforward(struct mbuf *m)
                /*
                 * Is it now for a local address on this host?
                 */
-               if (in_localip(dest))
+               if (in_localip_fast(dest))
                        goto forwardlocal;
                /*
                 * Go on with new destination address
@@ -479,7 +479,7 @@ passin:
                /*
                 * Is it now for a local address on this host?
                 */
-               if (m->m_flags & M_FASTFWD_OURS || in_localip(dest)) {
+               if (m->m_flags & M_FASTFWD_OURS || in_localip_fast(dest)) {
 forwardlocal:
                        /*
                         * Return packet for processing by ip_input().
diff --git a/sys/netinet/ipfw/ip_fw2.c b/sys/netinet/ipfw/ip_fw2.c
index b76a638..53f6e97 100644
--- a/sys/netinet/ipfw/ip_fw2.c
+++ b/sys/netinet/ipfw/ip_fw2.c
@@ -1450,10 +1450,7 @@ do {                                                     
        \
 
                        case O_IP_SRC_ME:
                                if (is_ipv4) {
-                                       struct ifnet *tif;
-
-                                       INADDR_TO_IFP(src_ip, tif);
-                                       match = (tif != NULL);
+                                       match = in_localip_fast(src_ip);
                                        break;
                                }
 #ifdef INET6
@@ -1490,10 +1487,7 @@ do {                                                     
        \
 
                        case O_IP_DST_ME:
                                if (is_ipv4) {
-                                       struct ifnet *tif;
-
-                                       INADDR_TO_IFP(dst_ip, tif);
-                                       match = (tif != NULL);
+                                       match = in_localip_fast(dst_ip);
                                        break;
                                }
 #ifdef INET6
diff --git a/sys/netinet/ipfw/ip_fw_pfil.c b/sys/netinet/ipfw/ip_fw_pfil.c
index a21f501..bdf8beb 100644
--- a/sys/netinet/ipfw/ip_fw_pfil.c
+++ b/sys/netinet/ipfw/ip_fw_pfil.c
@@ -184,7 +184,7 @@ again:
                bcopy(args.next_hop, (fwd_tag+1), sizeof(struct sockaddr_in));
                m_tag_prepend(*m0, fwd_tag);
 
-               if (in_localip(args.next_hop->sin_addr))
+               if (in_localip_fast(args.next_hop->sin_addr))
                        (*m0)->m_flags |= M_FASTFWD_OURS;
            }
 #endif /* INET || INET6 */
commit 67a74d91a7b4a47a83fcfa5e79a6c6f0b4b1122d
Author: Charlie Root <r...@test15.yandex.net>
Date:   Fri Oct 26 17:10:52 2012 +0000

    Remove rte locking for IPv4. Remove one of 2 locks from IPv6 rtes

diff --git a/sys/net/if.c b/sys/net/if.c
index a875326..eb6a723 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -487,6 +487,13 @@ if_alloc(u_char type)
        return (ifp);
 }
 
+
+void
+if_free_real(struct ifnet *ifp)
+{
+       free(ifp, M_IFNET);
+}
+
 /*
  * Do the actual work of freeing a struct ifnet, and layer 2 common
  * structure.  This call is made when the last reference to an
@@ -499,6 +506,15 @@ if_free_internal(struct ifnet *ifp)
        KASSERT((ifp->if_flags & IFF_DYING),
            ("if_free_internal: interface not dying"));
 
+       if (rtgc_is_enabled()) {
+               /* 
+                * FIXME: Sleep some time to permit packets
+                * using fastforwarding routine without locking
+                * die withour side effects.
+                */
+               pause("if_free_gc", hz / 20); /* Sleep 50 milliseconds */
+       }
+
        if (if_com_free[ifp->if_alloctype] != NULL)
                if_com_free[ifp->if_alloctype](ifp->if_l2com,
                    ifp->if_alloctype);
@@ -511,7 +527,10 @@ if_free_internal(struct ifnet *ifp)
        IF_AFDATA_DESTROY(ifp);
        IF_ADDR_LOCK_DESTROY(ifp);
        ifq_delete(&ifp->if_snd);
-       free(ifp, M_IFNET);
+       if (rtgc_is_enabled())
+               rtgc_free(RTGC_IF, ifp, 0);
+       else
+               if_free_real(ifp);
 }
 
 /*
diff --git a/sys/net/if_var.h b/sys/net/if_var.h
index 39c499f..5ef6264 100644
--- a/sys/net/if_var.h
+++ b/sys/net/if_var.h
@@ -857,6 +857,7 @@ void        if_down(struct ifnet *);
 struct ifmultiaddr *
        if_findmulti(struct ifnet *, struct sockaddr *);
 void   if_free(struct ifnet *);
+void   if_free_real(struct ifnet *);
 void   if_free_type(struct ifnet *, u_char);
 void   if_initname(struct ifnet *, const char *, int);
 void   if_link_state_change(struct ifnet *, int);
diff --git a/sys/net/route.c b/sys/net/route.c
index 3059f5a..97965b3 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -142,6 +142,175 @@ VNET_DEFINE(int, rttrash);                /* routes not 
in table but not freed */
 static VNET_DEFINE(uma_zone_t, rtzone);                /* Routing table UMA 
zone. */
 #define        V_rtzone        VNET(rtzone)
 
+SYSCTL_NODE(_net, OID_AUTO, gc, CTLFLAG_RW, 0, "Garbage collector");
+
+MALLOC_DEFINE(M_RTGC, "rtgc", "route GC");
+void rtgc_func(void *_unused);
+void rtfree_real(struct rtentry *rt);
+
+int _rtgc_default_enabled = 1;
+TUNABLE_INT("net.gc.enable", &_rtgc_default_enabled);
+
+#define        RTGC_CALLOUT_DELAY      1
+#define        RTGC_EXPIRE_DELAY       3
+
+VNET_DEFINE(struct mtx, rtgc_mtx);
+#define        V_rtgc_mtx      VNET(rtgc_mtx)
+VNET_DEFINE(struct callout, rtgc_callout);
+#define        V_rtgc_callout  VNET(rtgc_callout)
+VNET_DEFINE(int, rtgc_enabled);
+#define        V_rtgc_enabled  VNET(rtgc_enabled)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, enable, CTLFLAG_RW,
+       &VNET_NAME(rtgc_enabled), 1,
+       "Enable garbage collector");
+VNET_DEFINE(int, rtgc_expire_delay) = RTGC_EXPIRE_DELAY;
+#define        V_rtgc_expire_delay     VNET(rtgc_expire_delay)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, expire, CTLFLAG_RW,
+       &VNET_NAME(rtgc_expire_delay), 1,
+       "Object expiration delay");
+VNET_DEFINE(int, rtgc_numfailures);
+#define        V_rtgc_numfailures      VNET(rtgc_numfailures)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, failures, CTLFLAG_RD,
+       &VNET_NAME(rtgc_numfailures), 0,
+       "Number of objects leaked from route garbage collector");
+VNET_DEFINE(int, rtgc_numqueued);
+#define        V_rtgc_numqueued        VNET(rtgc_numqueued)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, queued, CTLFLAG_RD,
+       &VNET_NAME(rtgc_numqueued), 0,
+       "Number of objects queued for deletion");
+VNET_DEFINE(int, rtgc_numfreed);
+#define        V_rtgc_numfreed VNET(rtgc_numfreed)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, freed, CTLFLAG_RD,
+       &VNET_NAME(rtgc_numfreed), 0,
+       "Number of objects deleted");
+VNET_DEFINE(int, rtgc_numinvoked);
+#define        V_rtgc_numinvoked       VNET(rtgc_numinvoked)
+SYSCTL_VNET_INT(_net_gc, OID_AUTO, invoked, CTLFLAG_RD,
+       &VNET_NAME(rtgc_numinvoked), 0,
+       "Number of times GC was invoked");
+
+struct rtgc_item {
+       time_t  expire; /* Whe we can delete this entry */
+       int     etype;  /* Entry type */
+       void    *data;  /* data to free */
+       TAILQ_ENTRY(rtgc_item)  items;
+};
+
+VNET_DEFINE(TAILQ_HEAD(, rtgc_item), rtgc_queue);
+#define        V_rtgc_queue    VNET(rtgc_queue)
+
+int
+rtgc_is_enabled()
+{
+       return V_rtgc_enabled;
+}
+
+void
+rtgc_func(void *_unused)
+{
+       struct rtgc_item *item, *temp_item;
+       TAILQ_HEAD(, rtgc_item) rtgc_tq;
+       int empty, deleted;
+
+       CTR2(KTR_NET, "%s: started with %d objects", __func__, 
V_rtgc_numqueued);
+
+       TAILQ_INIT(&rtgc_tq);
+
+       /* Move all contents of current queue to new empty queue */
+       mtx_lock(&V_rtgc_mtx);
+       V_rtgc_numinvoked++;
+       TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+       mtx_unlock(&V_rtgc_mtx);
+
+       deleted = 0;
+
+       /* Dispatch as much as we can */
+       TAILQ_FOREACH_SAFE(item, &rtgc_tq, items, temp_item) {
+               if (item->expire > time_uptime)
+                       break;
+
+               /* We can definitely delete this item */
+               TAILQ_REMOVE(&rtgc_tq, item, items);
+
+               switch (item->etype) {
+               case RTGC_ROUTE:
+                       CTR1(KTR_NET, "Freeing route structure %p", item->data);
+                       rtfree_real((struct rtentry *)item->data);
+                       break;
+               case RTGC_IF:
+                       CTR1(KTR_NET, "Freeing iface structure %p", item->data);
+                       if_free_real((struct ifnet *)item->data);
+                       break;
+               default:
+                       CTR2(KTR_NET, "Unknown type: %d %p", item->etype, 
item->data);
+                       break;
+               }
+
+               /* Remove item itself */
+               free(item, M_RTGC);
+               deleted++;
+       }
+
+       /*
+        * Add remaining data back to mail queue.
+        * Note items are still sorted by time_uptime after merge.
+        */
+
+       mtx_lock(&V_rtgc_mtx);
+       /* Add new items to the end of our temporary queue */
+       TAILQ_CONCAT(&rtgc_tq, &rtgc_queue, items);
+       /* Move items back to stable storage */
+       TAILQ_SWAP(&rtgc_queue, &rtgc_tq, rtgc_item, items);
+       /* Check if we need to run callout another time */
+       empty = TAILQ_EMPTY(&rtgc_queue);
+       /* Update counters */
+       V_rtgc_numfreed += deleted;
+       V_rtgc_numqueued -= deleted;
+       mtx_unlock(&V_rtgc_mtx);
+
+       CTR4(KTR_NET, "%s: ended with %d object(s) (%d deleted), callout: %s",
+               __func__, V_rtgc_numqueued, deleted, empty ? "stopped" : 
"sheduled");
+       /* Schedule ourself iff there are items to delete */
+       if (!empty)
+               callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, 
rtgc_func, NULL);
+}
+
+void
+rtgc_free(int etype, void *data, int can_sleep)
+{
+       struct rtgc_item *item;
+
+       item = malloc(sizeof(struct rtgc_item), M_RTGC, (can_sleep ? M_WAITOK : 
M_NOWAIT) | M_ZERO);
+       if (item == NULL) {
+               V_rtgc_numfailures++; /* XXX: locking */
+               return; /* Skip route freeing. Memory leak is much better than 
panic */
+       }
+
+       item->expire = time_uptime + V_rtgc_expire_delay;
+       item->etype = etype;
+       item->data = data;
+
+       if ((!can_sleep) && (mtx_trylock(&V_rtgc_mtx) == 0)) {
+               /* Fail to acquire lock. Add another leak */
+               free(item, M_RTGC);
+               V_rtgc_numfailures++; /* XXX: locking */
+               return;
+       }
+
+       if (can_sleep)
+               mtx_lock(&V_rtgc_mtx);
+
+       TAILQ_INSERT_TAIL(&rtgc_queue, item, items);
+       V_rtgc_numqueued++;
+
+       mtx_unlock(&V_rtgc_mtx);
+
+       /* Schedule callout if not running */
+       if (!callout_pending(&V_rtgc_callout))
+               callout_reset(&V_rtgc_callout, hz * RTGC_CALLOUT_DELAY, 
rtgc_func, NULL);
+}
+
+
 /*
  * handler for net.my_fibnum
  */
@@ -241,6 +410,17 @@ vnet_route_init(const void *unused __unused)
                        dom->dom_rtattach((void **)rnh, dom->dom_rtoffset);
                }
        }
+
+       /* Init garbage collector */
+       mtx_init(&V_rtgc_mtx, "routeGC", NULL, MTX_DEF);
+       /* Init queue */
+       TAILQ_INIT(&V_rtgc_queue);
+       /* Init garbage callout */
+       memset(&V_rtgc_callout, 0, sizeof(rtgc_callout));
+       callout_init(&V_rtgc_callout, 1);
+       /* Set default from loader tunable */
+       V_rtgc_enabled = _rtgc_default_enabled;
+       //callout_reset(&V_rtgc_callout, 3 * hz, &rtgc_func, NULL);
 }
 VNET_SYSINIT(vnet_route_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_FOURTH,
     vnet_route_init, 0);
@@ -351,6 +531,74 @@ rtalloc1(struct sockaddr *dst, int report, u_long ignflags)
 }
 
 struct rtentry *
+rtalloc1_fib_nolock(struct sockaddr *dst, int report, u_long ignflags,
+                   u_int fibnum)
+{
+       struct radix_node_head *rnh;
+       struct radix_node *rn;
+       struct rtentry *newrt;
+       struct rt_addrinfo info;
+       int err = 0, msgtype = RTM_MISS;
+       int needlock;
+
+       KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
+       switch (dst->sa_family) {
+       case AF_INET6:
+       case AF_INET:
+               /* We support multiple FIBs. */
+               break;
+       default:
+               fibnum = RT_DEFAULT_FIB;
+               break;
+       }
+       rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
+       newrt = NULL;
+       if (rnh == NULL)
+               goto miss;
+
+       /*
+        * Look up the address in the table for that Address Family
+        */
+       needlock = !(ignflags & RTF_RNH_LOCKED);
+       if (needlock)
+               RADIX_NODE_HEAD_RLOCK(rnh);
+#ifdef INVARIANTS      
+       else
+               RADIX_NODE_HEAD_LOCK_ASSERT(rnh);
+#endif
+       rn = rnh->rnh_matchaddr(dst, rnh);
+       if (rn && ((rn->rn_flags & RNF_ROOT) == 0)) {
+               newrt = RNTORT(rn);
+               if (needlock)
+                       RADIX_NODE_HEAD_RUNLOCK(rnh);
+               goto done;
+
+       } else if (needlock)
+               RADIX_NODE_HEAD_RUNLOCK(rnh);
+       
+       /*
+        * Either we hit the root or couldn't find any match,
+        * Which basically means
+        * "caint get there frm here"
+        */
+miss:
+       V_rtstat.rts_unreach++;
+
+       if (report) {
+               /*
+                * If required, report the failure to the supervising
+                * Authorities.
+                * For a delete, this is not an error. (report == 0)
+                */
+               bzero(&info, sizeof(info));
+               info.rti_info[RTAX_DST] = dst;
+               rt_missmsg_fib(msgtype, &info, 0, err, fibnum);
+       }       
+done:
+       return (newrt);
+}
+
+struct rtentry *
 rtalloc1_fib(struct sockaddr *dst, int report, u_long ignflags,
                    u_int fibnum)
 {
@@ -422,6 +670,23 @@ done:
        return (newrt);
 }
 
+
+void
+rtfree_real(struct rtentry *rt)
+{
+       /*
+        * The key is separatly alloc'd so free it (see rt_setgate()).
+        * This also frees the gateway, as they are always malloc'd
+        * together.
+        */
+       Free(rt_key(rt));
+       
+       /*
+        * and the rtentry itself of course
+        */
+       uma_zfree(V_rtzone, rt);
+}
+
 /*
  * Remove a reference count from an rtentry.
  * If the count gets low enough, take it out of the routing table
@@ -484,18 +749,13 @@ rtfree(struct rtentry *rt)
                 */
                if (rt->rt_ifa)
                        ifa_free(rt->rt_ifa);
-               /*
-                * The key is separatly alloc'd so free it (see rt_setgate()).
-                * This also frees the gateway, as they are always malloc'd
-                * together.
-                */
-               Free(rt_key(rt));
 
-               /*
-                * and the rtentry itself of course
-                */
                RT_LOCK_DESTROY(rt);
-               uma_zfree(V_rtzone, rt);
+
+               if (V_rtgc_enabled)
+                       rtgc_free(RTGC_ROUTE, rt, 0);
+               else
+                       rtfree_real(rt);
                return;
        }
 done:
diff --git a/sys/net/route.h b/sys/net/route.h
index b26ac44..3aa694d 100644
--- a/sys/net/route.h
+++ b/sys/net/route.h
@@ -363,9 +363,14 @@ void        rt_maskedcopy(struct sockaddr *, struct 
sockaddr *, struct sockaddr *);
  *
  *    RTFREE() uses an unlocked entry.
  */
+#define RTGC_ROUTE     1
+#define RTGC_IF                3
+
 
 int     rtexpunge(struct rtentry *);
 void    rtfree(struct rtentry *);
+void    rtgc_free(int etype, void *data, int can_sleep);
+int    rtgc_is_enabled(void);
 int     rt_check(struct rtentry **, struct rtentry **, struct sockaddr *);
 
 /* XXX MRT COMPAT VERSIONS THAT SET UNIVERSE to 0 */
@@ -394,6 +399,7 @@ int  rt_getifa_fib(struct rt_addrinfo *, u_int fibnum);
 void    rtalloc_ign_fib(struct route *ro, u_long ignflags, u_int fibnum);
 void    rtalloc_fib(struct route *ro, u_int fibnum);
 struct rtentry *rtalloc1_fib(struct sockaddr *, int, u_long, u_int);
+struct rtentry *rtalloc1_fib_nolock(struct sockaddr *, int, u_long, u_int);
 int     rtioctl_fib(u_long, caddr_t, u_int);
 void    rtredirect_fib(struct sockaddr *, struct sockaddr *,
            struct sockaddr *, int, struct sockaddr *, u_int);
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1389873..1c9d9db 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -122,12 +122,12 @@ in_matroute(void *v_arg, struct radix_node_head *head)
        struct rtentry *rt = (struct rtentry *)rn;
 
        if (rt) {
-               RT_LOCK(rt);
+//             RT_LOCK(rt);
                if (rt->rt_flags & RTPRF_OURS) {
                        rt->rt_flags &= ~RTPRF_OURS;
                        rt->rt_rmx.rmx_expire = 0;
                }
-               RT_UNLOCK(rt);
+//             RT_UNLOCK(rt);
        }
        return rn;
 }
@@ -365,7 +365,7 @@ in_inithead(void **head, int off)
 
        rnh = *head;
        rnh->rnh_addaddr = in_addroute;
-       rnh->rnh_matchaddr = in_matroute;
+       rnh->rnh_matchaddr = rn_match;
        rnh->rnh_close = in_clsroute;
        if (_in_rt_was_here == 0 ) {
                callout_init(&V_rtq_timer, CALLOUT_MPSAFE);
diff --git a/sys/netinet/ip_fastfwd.c b/sys/netinet/ip_fastfwd.c
index d7fe411..d2b98b3 100644
--- a/sys/netinet/ip_fastfwd.c
+++ b/sys/netinet/ip_fastfwd.c
@@ -112,6 +112,22 @@ static VNET_DEFINE(int, ipfastforward_active);
 SYSCTL_VNET_INT(_net_inet_ip, OID_AUTO, fastforwarding, CTLFLAG_RW,
     &VNET_NAME(ipfastforward_active), 0, "Enable fast IP forwarding");
 
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum);
+
+void
+rtalloc_ign_fib_nolock(struct route *ro, u_long ignore, u_int fibnum)
+{
+       struct rtentry *rt;
+
+       if ((rt = ro->ro_rt) != NULL) {
+               if (rt->rt_ifp != NULL && rt->rt_flags & RTF_UP)
+                       return;
+               ro->ro_rt = NULL;
+       }
+       ro->ro_rt = rtalloc1_fib_nolock(&ro->ro_dst, 1, ignore, fibnum);
+}
+
 static struct sockaddr_in *
 ip_findroute(struct route *ro, struct in_addr dest, struct mbuf *m)
 {
@@ -126,7 +142,7 @@ ip_findroute(struct route *ro, struct in_addr dest, struct 
mbuf *m)
        dst->sin_family = AF_INET;
        dst->sin_len = sizeof(*dst);
        dst->sin_addr.s_addr = dest.s_addr;
-       in_rtalloc_ign(ro, 0, M_GETFIB(m));
+       rtalloc_ign_fib_nolock(ro, 0, M_GETFIB(m));
 
        /*
         * Route there and interface still up?
@@ -140,8 +156,10 @@ ip_findroute(struct route *ro, struct in_addr dest, struct 
mbuf *m)
        } else {
                IPSTAT_INC(ips_noroute);
                IPSTAT_INC(ips_cantforward);
+#if 0
                if (rt)
                        RTFREE(rt);
+#endif
                icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_HOST, 0, 0);
                return NULL;
        }
@@ -334,10 +352,11 @@ ip_fastforward(struct mbuf *m)
        if (in_localip(ip->ip_dst))
                return m;
 
-       IPSTAT_INC(ips_total);
+       //IPSTAT_INC(ips_total);
 
        /*
         * Step 3: incoming packet firewall processing
+       in_rtalloc_ign(ro, 0, M_GETFIB(m));
         */
 
        /*
@@ -476,8 +495,10 @@ forwardlocal:
                         * "ours"-label.
                         */
                        m->m_flags |= M_FASTFWD_OURS;
+/*
                        if (ro.ro_rt)
                                RTFREE(ro.ro_rt);
+*/                             
                        return m;
                }
                /*
@@ -490,7 +511,7 @@ forwardlocal:
                        m_tag_delete(m, fwd_tag);
                }
 #endif /* IPFIREWALL_FORWARD */
-               RTFREE(ro.ro_rt);
+//             RTFREE(ro.ro_rt);
                if ((dst = ip_findroute(&ro, dest, m)) == NULL)
                        return NULL;    /* icmp unreach already sent */
                ifp = ro.ro_rt->rt_ifp;
@@ -601,17 +622,21 @@ passout:
        if (error != 0)
                IPSTAT_INC(ips_odropped);
        else {
+#if 0
                ro.ro_rt->rt_rmx.rmx_pksent++;
                IPSTAT_INC(ips_forward);
                IPSTAT_INC(ips_fastforward);
+#endif
        }
 consumed:
-       RTFREE(ro.ro_rt);
+//     RTFREE(ro.ro_rt);
        return NULL;
 drop:
        if (m)
                m_freem(m);
+/*
        if (ro.ro_rt)
                RTFREE(ro.ro_rt);
+*/             
        return NULL;
 }
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index b526030..9aabe63 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -195,12 +195,12 @@ in6_matroute(void *v_arg, struct radix_node_head *head)
        struct rtentry *rt = (struct rtentry *)rn;
 
        if (rt) {
-               RT_LOCK(rt);
+               //RT_LOCK(rt);
                if (rt->rt_flags & RTPRF_OURS) {
                        rt->rt_flags &= ~RTPRF_OURS;
                        rt->rt_rmx.rmx_expire = 0;
                }
-               RT_UNLOCK(rt);
+               //RT_UNLOCK(rt);
        }
        return rn;
 }
@@ -440,7 +440,7 @@ in6_inithead(void **head, int off)
 
        rnh = *head;
        rnh->rnh_addaddr = in6_addroute;
-       rnh->rnh_matchaddr = in6_matroute;
+       rnh->rnh_matchaddr = rn_match;
 
        if (V__in6_rt_was_here == 0) {
                callout_init(&V_rtq_timer6, CALLOUT_MPSAFE);
commit 0e7cebd1753c3b77bdc00d728fbd5910c2d2afec
Author: Charlie Root <r...@test15.yandex.net>
Date:   Mon Apr 8 15:35:00 2013 +0000

    Make radix use rmlock.

diff --git a/sys/contrib/ipfilter/netinet/ip_compat.h 
b/sys/contrib/ipfilter/netinet/ip_compat.h
index 31e5b11..5e74da4 100644
--- a/sys/contrib/ipfilter/netinet/ip_compat.h
+++ b/sys/contrib/ipfilter/netinet/ip_compat.h
@@ -870,6 +870,7 @@ typedef     u_int32_t       u_32_t;
 # if (__FreeBSD_version >= 500043)
 #  include <sys/mutex.h>
 #  if (__FreeBSD_version > 700014)
+#   include <sys/rmlock.h>
 #   include <sys/rwlock.h>
 #    define    KRWLOCK_T               struct rwlock
 #    ifdef _KERNEL
diff --git a/sys/contrib/pf/net/pf_table.c b/sys/contrib/pf/net/pf_table.c
index 40c9f67..b1dd703 100644
--- a/sys/contrib/pf/net/pf_table.c
+++ b/sys/contrib/pf/net/pf_table.c
@@ -44,6 +44,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/kernel.h>
 #include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #ifdef __FreeBSD__
 #include <sys/malloc.h>
diff --git a/sys/kern/subr_witness.c b/sys/kern/subr_witness.c
index e565d01..f913d27 100644
--- a/sys/kern/subr_witness.c
+++ b/sys/kern/subr_witness.c
@@ -508,7 +508,7 @@ static struct witness_order_list_entry order_lists[] = {
         * Routing
         */
        { "so_rcv", &lock_class_mtx_sleep },
-       { "radix node head", &lock_class_rw },
+       { "radix node head", &lock_class_rm },
        { "rtentry", &lock_class_mtx_sleep },
        { "ifaddr", &lock_class_mtx_sleep },
        { NULL, NULL },
diff --git a/sys/kern/sys_socket.c b/sys/kern/sys_socket.c
index 4cbae74..fea12d0 100644
--- a/sys/kern/sys_socket.c
+++ b/sys/kern/sys_socket.c
@@ -50,6 +50,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/ucred.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/kern/vfs_export.c b/sys/kern/vfs_export.c
index 4185211..848c232 100644
--- a/sys/kern/vfs_export.c
+++ b/sys/kern/vfs_export.c
@@ -47,7 +47,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/mbuf.h>
 #include <sys/mount.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/refcount.h>
 #include <sys/socket.h>
 #include <sys/systm.h>
@@ -427,6 +427,7 @@ vfs_export_lookup(struct mount *mp, struct sockaddr *nam)
        register struct netcred *np;
        register struct radix_node_head *rnh;
        struct sockaddr *saddr;
+       RADIX_NODE_HEAD_READER;
 
        nep = mp->mnt_export;
        if (nep == NULL)
diff --git a/sys/net/if.c b/sys/net/if.c
index 5ecde8c..351e046 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -51,6 +51,7 @@
 #include <sys/lock.h>
 #include <sys/refcount.h>
 #include <sys/module.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/sockio.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.c b/sys/net/radix.c
index 33fcf82..d8d1e8b 100644
--- a/sys/net/radix.c
+++ b/sys/net/radix.c
@@ -37,7 +37,7 @@
 #ifdef _KERNEL
 #include <sys/lock.h>
 #include <sys/mutex.h>
-#include <sys/rwlock.h>
+#include <sys/rmlock.h>
 #include <sys/systm.h>
 #include <sys/malloc.h>
 #include <sys/syslog.h>
diff --git a/sys/net/radix.h b/sys/net/radix.h
index 29659b5..2d130f0 100644
--- a/sys/net/radix.h
+++ b/sys/net/radix.h
@@ -36,7 +36,7 @@
 #ifdef _KERNEL
 #include <sys/_lock.h>
 #include <sys/_mutex.h>
-#include <sys/_rwlock.h>
+#include <sys/_rmlock.h>
 #endif
 
 #ifdef MALLOC_DECLARE
@@ -133,7 +133,7 @@ struct radix_node_head {
        struct  radix_node rnh_nodes[3];        /* empty tree for common case */
        int     rnh_multipath;                  /* multipath capable ? */
 #ifdef _KERNEL
-       struct  rwlock rnh_lock;                /* locks entire radix tree */
+       struct  rmlock rnh_lock;                /* locks entire radix tree */
 #endif
 };
 
@@ -146,18 +146,21 @@ struct radix_node_head {
 #define R_Zalloc(p, t, n) (p = (t) malloc((unsigned long)(n), M_RTABLE, 
M_NOWAIT | M_ZERO))
 #define Free(p) free((caddr_t)p, M_RTABLE);
 
+#define        RADIX_NODE_HEAD_READER          struct rm_priotracker tracker
 #define        RADIX_NODE_HEAD_LOCK_INIT(rnh)  \
-    rw_init_flags(&(rnh)->rnh_lock, "radix node head", 0)
-#define        RADIX_NODE_HEAD_LOCK(rnh)       rw_wlock(&(rnh)->rnh_lock)
-#define        RADIX_NODE_HEAD_UNLOCK(rnh)     rw_wunlock(&(rnh)->rnh_lock)
-#define        RADIX_NODE_HEAD_RLOCK(rnh)      rw_rlock(&(rnh)->rnh_lock)
-#define        RADIX_NODE_HEAD_RUNLOCK(rnh)    rw_runlock(&(rnh)->rnh_lock)
-#define        RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)   
rw_try_upgrade(&(rnh)->rnh_lock)
-
-
-#define        RADIX_NODE_HEAD_DESTROY(rnh)    rw_destroy(&(rnh)->rnh_lock)
-#define        RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, 
RA_LOCKED)
-#define        RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, 
RA_WLOCKED)
+    rm_init(&(rnh)->rnh_lock, "radix node head")
+#define        RADIX_NODE_HEAD_LOCK(rnh)       rm_wlock(&(rnh)->rnh_lock)
+#define        RADIX_NODE_HEAD_UNLOCK(rnh)     rm_wunlock(&(rnh)->rnh_lock)
+#define        RADIX_NODE_HEAD_RLOCK(rnh)      rm_rlock(&(rnh)->rnh_lock, 
&tracker)
+#define        RADIX_NODE_HEAD_RUNLOCK(rnh)    rm_runlock(&(rnh)->rnh_lock, 
&tracker)
+//#define      RADIX_NODE_HEAD_LOCK_TRY_UPGRADE(rnh)   
rw_try_upgrade(&(rnh)->rnh_lock)
+
+
+#define        RADIX_NODE_HEAD_DESTROY(rnh)    rm_destroy(&(rnh)->rnh_lock)
+#define        RADIX_NODE_HEAD_LOCK_ASSERT(rnh)
+#define        RADIX_NODE_HEAD_WLOCK_ASSERT(rnh)
+//#define      RADIX_NODE_HEAD_LOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, 
RA_LOCKED)
+//#define      RADIX_NODE_HEAD_WLOCK_ASSERT(rnh) rw_assert(&(rnh)->rnh_lock, 
RA_WLOCKED)
 #endif /* _KERNEL */
 
 void    rn_init(int);
diff --git a/sys/net/radix_mpath.c b/sys/net/radix_mpath.c
index ee7826f..c69888e 100644
--- a/sys/net/radix_mpath.c
+++ b/sys/net/radix_mpath.c
@@ -45,6 +45,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/domain.h>
 #include <sys/syslog.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/radix.h>
 #include <net/radix_mpath.h>
 #include <net/route.h>
diff --git a/sys/net/route.c b/sys/net/route.c
index 5d56688..2cf6ea5 100644
--- a/sys/net/route.c
+++ b/sys/net/route.c
@@ -52,6 +52,8 @@
 #include <sys/proc.h>
 #include <sys/domain.h>
 #include <sys/kernel.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
@@ -544,6 +546,7 @@ rtalloc1_fib_nolock(struct sockaddr *dst, int report, 
u_long ignflags,
        struct rtentry *newrt;
        struct rt_addrinfo info;
        int err = 0, msgtype = RTM_MISS;
+       RADIX_NODE_HEAD_READER;
        int needlock;
 
        KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -612,6 +615,7 @@ rtalloc1_fib(struct sockaddr *dst, int report, u_long 
ignflags,
        struct rtentry *newrt;
        struct rt_addrinfo info;
        int err = 0, msgtype = RTM_MISS;
+       RADIX_NODE_HEAD_READER;
        int needlock;
 
        KASSERT((fibnum < rt_numfibs), ("rtalloc1_fib: bad fibnum"));
@@ -799,6 +803,7 @@ rtredirect_fib(struct sockaddr *dst,
        struct rt_addrinfo info;
        struct ifaddr *ifa;
        struct radix_node_head *rnh;
+       RADIX_NODE_HEAD_READER;
 
        ifa = NULL;
        rnh = rt_tables_get_rnh(fibnum, dst->sa_family);
diff --git a/sys/net/rtsock.c b/sys/net/rtsock.c
index 58c46a6..18d3e06 100644
--- a/sys/net/rtsock.c
+++ b/sys/net/rtsock.c
@@ -45,6 +45,7 @@
 #include <sys/priv.h>
 #include <sys/proc.h>
 #include <sys/protosw.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/signalvar.h>
 #include <sys/socket.h>
@@ -577,6 +578,7 @@ route_output(struct mbuf *m, struct socket *so)
        struct ifnet *ifp = NULL;
        union sockaddr_union saun;
        sa_family_t saf = AF_UNSPEC;
+       RADIX_NODE_HEAD_READER;
 
 #define senderr(e) { error = e; goto flush;}
        if (m == NULL || ((m->m_len < sizeof(long)) &&
@@ -1818,6 +1820,7 @@ sysctl_rtsock(SYSCTL_HANDLER_ARGS)
        int     i, lim, error = EINVAL;
        u_char  af;
        struct  walkarg w;
+       RADIX_NODE_HEAD_READER;
 
        name ++;
        namelen--;
diff --git a/sys/netinet/in_rmx.c b/sys/netinet/in_rmx.c
index 1c9d9db..775ba5a 100644
--- a/sys/netinet/in_rmx.c
+++ b/sys/netinet/in_rmx.c
@@ -53,6 +53,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/callout.h>
 
 #include <net/if.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 #include <net/route.h>
 #include <net/vnet.h>
 
diff --git a/sys/netinet6/in6_ifattach.c b/sys/netinet6/in6_ifattach.c
index 80eb022..cbfe1d8 100644
--- a/sys/netinet6/in6_ifattach.c
+++ b/sys/netinet6/in6_ifattach.c
@@ -42,6 +42,8 @@ __FBSDID("$FreeBSD$");
 #include <sys/proc.h>
 #include <sys/syslog.h>
 #include <sys/md5.h>
+#include <sys/lock.h>
+#include <sys/rmlock.h>
 
 #include <net/if.h>
 #include <net/if_dl.h>
diff --git a/sys/netinet6/in6_rmx.c b/sys/netinet6/in6_rmx.c
index 9aabe63..a291db2 100644
--- a/sys/netinet6/in6_rmx.c
+++ b/sys/netinet6/in6_rmx.c
@@ -84,6 +84,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/socket.h>
 #include <sys/socketvar.h>
 #include <sys/mbuf.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/callout.h>
diff --git a/sys/netinet6/nd6_rtr.c b/sys/netinet6/nd6_rtr.c
index 687d84d..7737d47 100644
--- a/sys/netinet6/nd6_rtr.c
+++ b/sys/netinet6/nd6_rtr.c
@@ -45,6 +45,7 @@ __FBSDID("$FreeBSD: stable/8/sys/netinet6/nd6_rtr.c 233201 
2012-03-19 20:49:42Z
 #include <sys/kernel.h>
 #include <sys/lock.h>
 #include <sys/errno.h>
+#include <sys/rmlock.h>
 #include <sys/rwlock.h>
 #include <sys/syslog.h>
 #include <sys/queue.h>
commit 963196095589c03880ddd13a5c16f9e50cf6d7ce
Author: Charlie Root <r...@test15.yandex.net>
Date:   Sun Nov 4 15:52:50 2012 +0000

    Do not require locking arp lle

diff --git a/sys/net/if_llatbl.h b/sys/net/if_llatbl.h
index 9f6531b..c1b2af9 100644
--- a/sys/net/if_llatbl.h
+++ b/sys/net/if_llatbl.h
@@ -169,6 +169,7 @@ MALLOC_DECLARE(M_LLTABLE);
 #define        LLE_PUB         0x0020  /* publish entry ??? */
 #define        LLE_DELETE      0x4000  /* delete on a lookup - match 
LLE_IFADDR */
 #define        LLE_CREATE      0x8000  /* create on a lookup miss */
+#define        LLE_UNLOCKED    0x1000  /* return lle unlocked */
 #define        LLE_EXCLUSIVE   0x2000  /* return lle xlocked  */
 
 #define LLATBL_HASH(key, mask) \
diff --git a/sys/netinet/if_ether.c b/sys/netinet/if_ether.c
index f61b803..ecb9b8e 100644
--- a/sys/netinet/if_ether.c
+++ b/sys/netinet/if_ether.c
@@ -283,10 +283,10 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct 
mbuf *m,
        struct sockaddr *dst, u_char *desten, struct llentry **lle)
 {
        struct llentry *la = 0;
-       u_int flags = 0;
+       u_int flags = LLE_UNLOCKED;
        struct mbuf *curr = NULL;
        struct mbuf *next = NULL;
-       int error, renew;
+       int error, renew = 0;
 
        *lle = NULL;
        if (m != NULL) {
@@ -307,7 +307,41 @@ arpresolve(struct ifnet *ifp, struct rtentry *rt0, struct 
mbuf *m,
 retry:
        IF_AFDATA_RLOCK(ifp);   
        la = lla_lookup(LLTABLE(ifp), flags, dst);
+
+       /*
+        * Fast path. Do not require rlock on llentry.
+        */
+       if ((la != NULL) && (flags & LLE_UNLOCKED)) {
+               if ((la->la_flags & LLE_VALID) &&
+                   ((la->la_flags & LLE_STATIC) || la->la_expire > 
time_uptime)) {
+                       bcopy(&la->ll_addr, desten, ifp->if_addrlen);
+                       /*
+                        * If entry has an expiry time and it is approaching,
+                        * see if we need to send an ARP request within this
+                        * arpt_down interval.
+                        */
+                       if (!(la->la_flags & LLE_STATIC) &&
+                           time_uptime + la->la_preempt > la->la_expire) {
+                               renew = 1;
+                               la->la_preempt--;
+                       }
+
+                       IF_AFDATA_RUNLOCK(ifp);
+                       if (renew != 0)
+                               arprequest(ifp, NULL, &SIN(dst)->sin_addr, 
NULL);
+
+                       return (0);
+               }
+
+               /* Revert to normal path for other cases */
+               *lle = la;
+               LLE_RLOCK(la);
+       }
+
+       flags &= ~LLE_UNLOCKED;
+
        IF_AFDATA_RUNLOCK(ifp); 
+
        if ((la == NULL) && ((flags & LLE_EXCLUSIVE) == 0)
            && ((ifp->if_flags & (IFF_NOARP | IFF_STATICARP)) == 0)) {          
                flags |= (LLE_CREATE | LLE_EXCLUSIVE);
@@ -324,27 +358,6 @@ retry:
                return (EINVAL);
        } 
 
-       if ((la->la_flags & LLE_VALID) &&
-           ((la->la_flags & LLE_STATIC) || la->la_expire > time_second)) {
-               bcopy(&la->ll_addr, desten, ifp->if_addrlen);
-               /*
-                * If entry has an expiry time and it is approaching,
-                * see if we need to send an ARP request within this
-                * arpt_down interval.
-                */
-               if (!(la->la_flags & LLE_STATIC) &&
-                   time_second + la->la_preempt > la->la_expire) {
-                       arprequest(ifp, NULL,
-                           &SIN(dst)->sin_addr, IF_LLADDR(ifp));
-
-                       la->la_preempt--;
-               }
-               
-               *lle = la;
-               error = 0;
-               goto done;
-       } 
-                           
        if (la->la_flags & LLE_STATIC) {   /* should not happen! */
                log(LOG_DEBUG, "arpresolve: ouch, empty static llinfo for %s\n",
                    inet_ntoa(SIN(dst)->sin_addr));
diff --git a/sys/netinet/in.c b/sys/netinet/in.c
index eaba4e5..5341918 100644
--- a/sys/netinet/in.c
+++ b/sys/netinet/in.c
@@ -1561,7 +1561,7 @@ in_lltable_lookup(struct lltable *llt, u_int flags, const 
struct sockaddr *l3add
        if (LLE_IS_VALID(lle)) {
                if (flags & LLE_EXCLUSIVE)
                        LLE_WLOCK(lle);
-               else
+               else if (!(flags & LLE_UNLOCKED))
                        LLE_RLOCK(lle);
        }
 done:
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to