On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote:
> a Two-Port MAC Relay is basically a cut down bridge(4). it only supports
> two ports, and unconditionally relays packets between those ports
> instead of doing learning or anything like that.
> 
> i've been trying to get a redundant pair of bridges set up between two
> datacenters here to help me while i migrate between them. so far all my
> efforts to make it redundant have mostly worked, until they introduced
> loops in the layer 2 topology, which generates a broadcast storm, which
> basically takes the net down for a few minutes at a time. it's feels
> very betraying.
> 
> my frustration is that switches plugged together have mechanisms to
> prevent loops like that, more specifically they use spanning tree or
> lacp to make appropriate use of redundant links. i got to a point where
> i just wanted the switches to talk to each other and do their own thing
> to negotiate use of the redundant links.
> 
> unfortunately the only way to get ethernet packets off a physical
> wire and onto a tunnel over an ip network is bridge(4), and bridge(4)
> tries to be a compliant switch from a standards point of view. this
> means it intercepts packets that are meant to be processed by bridges,
> because it is a bridge. these types of packets include spanning tree and
> lacp, which means i couldnt get the physical switches at each site to
> talk to each other. sadface.
> 
> so to solve my problem i hacked up a small driver that did less than
> bridge(4). however, it turns out that what i hacked up is an actual
> thing that already exists as something done in the real world. IEEE
> 802.1Q describes TPMR, which is defined as intercepting far less
> than a real bridge does. one of the appendices specifically describes
> lacp going through one, which is exactly what i wanted. cisco does
> something like this with their layer 2 cross-connects (search for cisco
> xconnect for examples), juniper has l2circuits, and so on.
> 
> the way i'm using this is like below. i have a pair of bridges in each
> datacenter, so 4 boxes in total. they peer directly with the ip network
> that sits between the datacenter. each box has a 4 physical network
> ports. 2 of those ports are configured with aggr(4) and talk IP into the
> core network. the other two ports are connected to the switches at
> each site for use with tpmr. there's 2 etherip interfaces configured on
> each physical box, each of which is connected to the tpmr.
> 
> all that together looks a bit like the following:
> 
>  +-+ +--------------------------+      +---------------------------+ +-+
>  |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d|
>  |c| |                          |      |                           | |c|
>  |0|-|ix3 <-> tpmr1 <-> etherip1|-    -|etherip1 <-> tpmr1 <-> ixl1|-|1|
>  ||| +--------------------------+ \  / +---------------------------+ |||
>  |s|         dc0-bridge0           \/          dc1-bridge0           |s|
>  |w|                               /\                                |w|
>  |i| +--------------------------+ /  \ +---------------------------+ |i|
>  |t|-|ix2 <-> tpmr0 <-> etherip0|-    -|etherip0 <-> tpmr0 <-> ixl0|-|t|
>  |c| |                          |      |                           | |c|
>  |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h|
>  +-+ +--------------------------+      +---------------------------+ +-+
>              dc0-bridge1                       dc1-bridge1
> 
> each switch has a 4 port port-channel (lacp aggregation) set up. because
> each physical interface on the bridges are tied to a single tunnel, the
> packets effectively traverse a point-to-point link, ie, a really
> complicated wire. because lacp makes it from each point to the other
> point, the switches make sure only active lacp ports are used, which
> avoids layer 2 loops. lacp also means i get to use all the links when
> theyre available.
> 
> with the topology above i can lose a bridge at each site and should
> still have a working link to the other side, so i get my redundancy. the
> use of the extra links with lacp is a bonus. at this point i would have
> been happy for spanning tree to shut links down.
> 
> anyway, here's the code.
> 
> it was originally called xcon(4) since it provides a software
> cross-connect, but i changed my mind after looking at 802.1Q. it might
> be unfair to refer to 802.1Q because tpmr(4) does none of the filtering
> that the spec says it should. i just needed it to work though.
> 
> the guts of it is tpmr_input(). it basically gets the rxed packet from
> one port and enqueues it for tranmission immediately on the other port.
> it does run bpf though, and supports filtering on bpf, which has been
> handy for us when we needed to test taking bpdus off the wire for a bit.
> 
> because it does such a small amount of work, it is relatively fast.
> hrvoje popovski has given it a quick spin and seen the following
> results on a fast box with a pair of ix(4) interfaces:
> 
> plain ip forwarding: 1.5Mpps
> bridge(4) under load from 14Mpps: 500Kpps
> bridge(4) under load from 1Mpps: 800Kpps
> tpmr(4): 1.75Mpps
> 
> 1.75Mpps was lower than I was expecting, but it turns out he was hitting
> limits in other parts of the system. with some tuning we got it up to
> 2.25Mpps. the softnet taskq was only at about 66% cpu time, but we
> couldnt see any other obvious places that we were dropping load.
> 
> on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do
> 1.6Mpps. it's worth noting that the boxes were extremely responsive (ie,
> ssh feels fine) when tpmr is under load, which is not the case when ip
> forwarding or bridge are being hammered.
> 
> my point is that it might be useful having tpmr(4) just to be able to
> test network driver performance improvements independently of the stack.
> im probably going to be using it to monitor links as a "bump in the
> wire" too.
> 
> lastly regarding the code. i made this use the trunk(4) ioctls instead
> of the bridge ones, mostly because i had to fake less stuff to make
> ifconfig output look ok.
> 
> ifconfig output looks like this:
> 
> xdlg@dc3-bridge1:~$ ifconfig tpmr
>      
> tpmr0: flags=51<UP,POINTOPOINT,RUNNING>
>       description: xconnect
>       index 15 priority 0 llprio 7
>       trunk: trunkproto none
>               ix2 port active,collecting,distributing
>               etherip10 port active,collecting,distributing
>       groups: tpmr
>       status: active
> 
> anyway. thoughts? ok?

Have you tried to use bridge with STP enabled in your setup? Just curious.
I understand that with STP on the OpenBSD box you could not use all links
and forwarding performance would not be as good.

Anyway, I think tpmr would be a nice addition!

Remi

Reply via email to