On Tue, Jul 30, 2019 at 01:36:59PM +1000, David Gwynne wrote: > a Two-Port MAC Relay is basically a cut down bridge(4). it only supports > two ports, and unconditionally relays packets between those ports > instead of doing learning or anything like that. > > i've been trying to get a redundant pair of bridges set up between two > datacenters here to help me while i migrate between them. so far all my > efforts to make it redundant have mostly worked, until they introduced > loops in the layer 2 topology, which generates a broadcast storm, which > basically takes the net down for a few minutes at a time. it's feels > very betraying. > > my frustration is that switches plugged together have mechanisms to > prevent loops like that, more specifically they use spanning tree or > lacp to make appropriate use of redundant links. i got to a point where > i just wanted the switches to talk to each other and do their own thing > to negotiate use of the redundant links. > > unfortunately the only way to get ethernet packets off a physical > wire and onto a tunnel over an ip network is bridge(4), and bridge(4) > tries to be a compliant switch from a standards point of view. this > means it intercepts packets that are meant to be processed by bridges, > because it is a bridge. these types of packets include spanning tree and > lacp, which means i couldnt get the physical switches at each site to > talk to each other. sadface. > > so to solve my problem i hacked up a small driver that did less than > bridge(4). however, it turns out that what i hacked up is an actual > thing that already exists as something done in the real world. IEEE > 802.1Q describes TPMR, which is defined as intercepting far less > than a real bridge does. one of the appendices specifically describes > lacp going through one, which is exactly what i wanted. cisco does > something like this with their layer 2 cross-connects (search for cisco > xconnect for examples), juniper has l2circuits, and so on. > > the way i'm using this is like below. i have a pair of bridges in each > datacenter, so 4 boxes in total. they peer directly with the ip network > that sits between the datacenter. each box has a 4 physical network > ports. 2 of those ports are configured with aggr(4) and talk IP into the > core network. the other two ports are connected to the switches at > each site for use with tpmr. there's 2 etherip interfaces configured on > each physical box, each of which is connected to the tpmr. > > all that together looks a bit like the following: > > +-+ +--------------------------+ +---------------------------+ +-+ > |d|-|ix2 <-> tpmr0 <-> etherip0|------|etherip0 <-> tpmr0 <-> ixl0|-|d| > |c| | | | | |c| > |0|-|ix3 <-> tpmr1 <-> etherip1|- -|etherip1 <-> tpmr1 <-> ixl1|-|1| > ||| +--------------------------+ \ / +---------------------------+ ||| > |s| dc0-bridge0 \/ dc1-bridge0 |s| > |w| /\ |w| > |i| +--------------------------+ / \ +---------------------------+ |i| > |t|-|ix2 <-> tpmr0 <-> etherip0|- -|etherip0 <-> tpmr0 <-> ixl0|-|t| > |c| | | | | |c| > |h|-|ix3 <-> tpmr1 <-> etherip1|------|etherip1 <-> tpmr1 <-> ixl1|-|h| > +-+ +--------------------------+ +---------------------------+ +-+ > dc0-bridge1 dc1-bridge1 > > each switch has a 4 port port-channel (lacp aggregation) set up. because > each physical interface on the bridges are tied to a single tunnel, the > packets effectively traverse a point-to-point link, ie, a really > complicated wire. because lacp makes it from each point to the other > point, the switches make sure only active lacp ports are used, which > avoids layer 2 loops. lacp also means i get to use all the links when > theyre available. > > with the topology above i can lose a bridge at each site and should > still have a working link to the other side, so i get my redundancy. the > use of the extra links with lacp is a bonus. at this point i would have > been happy for spanning tree to shut links down. > > anyway, here's the code. > > it was originally called xcon(4) since it provides a software > cross-connect, but i changed my mind after looking at 802.1Q. it might > be unfair to refer to 802.1Q because tpmr(4) does none of the filtering > that the spec says it should. i just needed it to work though. > > the guts of it is tpmr_input(). it basically gets the rxed packet from > one port and enqueues it for tranmission immediately on the other port. > it does run bpf though, and supports filtering on bpf, which has been > handy for us when we needed to test taking bpdus off the wire for a bit. > > because it does such a small amount of work, it is relatively fast. > hrvoje popovski has given it a quick spin and seen the following > results on a fast box with a pair of ix(4) interfaces: > > plain ip forwarding: 1.5Mpps > bridge(4) under load from 14Mpps: 500Kpps > bridge(4) under load from 1Mpps: 800Kpps > tpmr(4): 1.75Mpps > > 1.75Mpps was lower than I was expecting, but it turns out he was hitting > limits in other parts of the system. with some tuning we got it up to > 2.25Mpps. the softnet taskq was only at about 66% cpu time, but we > couldnt see any other obvious places that we were dropping load. > > on a slower box that can do IP forwarding at 1Mpps, tpmr(4) can do > 1.6Mpps. it's worth noting that the boxes were extremely responsive (ie, > ssh feels fine) when tpmr is under load, which is not the case when ip > forwarding or bridge are being hammered. > > my point is that it might be useful having tpmr(4) just to be able to > test network driver performance improvements independently of the stack. > im probably going to be using it to monitor links as a "bump in the > wire" too. > > lastly regarding the code. i made this use the trunk(4) ioctls instead > of the bridge ones, mostly because i had to fake less stuff to make > ifconfig output look ok. > > ifconfig output looks like this: > > xdlg@dc3-bridge1:~$ ifconfig tpmr > > tpmr0: flags=51<UP,POINTOPOINT,RUNNING> > description: xconnect > index 15 priority 0 llprio 7 > trunk: trunkproto none > ix2 port active,collecting,distributing > etherip10 port active,collecting,distributing > groups: tpmr > status: active > > anyway. thoughts? ok?
Have you tried to use bridge with STP enabled in your setup? Just curious. I understand that with STP on the OpenBSD box you could not use all links and forwarding performance would not be as good. Anyway, I think tpmr would be a nice addition! Remi