On Fri, 20 Dec 2019 13:09:52 -0500 Nick Wolff <darkfiber...@gmail.com> wrote:
> Marko, > > Are you aware of any write ups for using ng_eiface and ng_bridge > instead of if_bridge? It is not that complex at all: # kldload ng_ether # ifconfig em0 promisc # ngctl mkpeer em0: bridge lower link0 # ngctl name em0:lower b0 # ngctl connect em0: b0: upper link1 # ngctl mkpeer b0: eiface link2 ether # ngctl mkpeer b0: eiface link3 ether # ngctl mkpeer b0: eiface link4 ether Done - this should create interfaces ngeth0, ngeth1 and ngeth2, which one can assign to vnet jails. Note that unlike epair, ngeth instances do not automatically get a MAC address assigned, at least not on 11.x / 12.x, so this is an extra step one has to perform on his own. In our setup, we actually use https://github.com/imunes/imunes to set up the (netgraph-based) virtual network and nodes (vnet jails, aka vimages). Works reasonably well, having in mind that the thing was devised as a network emulation tool, not a virtual host provisioning framework. Marko > > Thanks, > > Nick Wolff > > On Fri, Dec 20, 2019 at 6:22 AM Marko Zec <z...@fer.hr> wrote: > > > Perhaps you could ditch if_bridge(4) and epair(4), and try > > ng_eiface(4) with ng_bridge(4) instead? Works rock-solid 24/7 here > > on 11.2 / 11.3. > > > > Marko > > > > On Fri, 20 Dec 2019 11:19:24 +0100 > > "Patrick M. Hausen" <hau...@punkt.de> wrote: > > > > > Hi all, > > > > > > we still experience occasional network outages in production, > > > yet have not been able to find the root cause. > > > > > > We run around 50 servers with VNET jails. some of them with > > > a handful, the busiest ones with 50 or more jails each. > > > > > > Every now and then the jails are not reachable over the net, > > > anymore. The server itself is up and running, all jails are > > > up and running, one can ssh to the server but none of the > > > jails can communicate over the network. > > > > > > There seems to be no pattern to the time of occurrance except > > > that more jails on one system make it "more likely". > > > Also having more than one bridge, e.g. for private networks > > > between jails seems to increase the probability. > > > When a server shows the problem it tends to get into the state > > > rather frequently, a couple of hours inbetween. Then again > > > most servers run for weeks without exhibiting the problem. > > > That's what makes it so hard to reproduce. The last couple of > > > days one system was failing regularly until we reduced the number > > > of jails from around 80 to around 50. Now it seems stable again. > > > > > > I have a test system with lots of jails that I work with gatling > > > that did not show a single failure so far :-( > > > > > > > > > Setup: > > > > > > All jails are iocage jails with VNET interfaces. They are > > > connected to at least one bridge that starts with the > > > physical external interface as a member and gets jails' > > > epair interfaces added as they start up. All jails are managed > > > by iocage. > > > > > > ifconfig_igb0="-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag > > > -vlanhwtso up" cloned_interfaces="bridge0" > > > ifconfig_bridge0_name="inet0" > > > ifconfig_inet0="addm igb0 up" > > > ifconfig_inet0_ipv6="inet6 <host-address>/64 auto_linklocal" > > > > > > $ iocage get interfaces vpro0087 > > > vnet0:inet0 > > > > > > $ ifconfig inet0 > > > inet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 > > > mtu 1500 ether 90:1b:0e:63:ef:51 > > > inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid > > > 0x4 inet6 <host-address> prefixlen 64 > > > nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> > > > groups: bridge > > > id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 > > > maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 > > > root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 > > > member: vnet0.4 > > > flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 7 > > > priority 128 path cost 2000 member: vnet0.1 > > > flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 6 > > > priority 128 path cost 2000 member: igb0 > > > flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> ifmaxaddr 0 port 1 > > > priority 128 path cost 2000000 > > > > > > > > > What we tried: > > > > > > At first we suspected the bridge to become "wedged" somehow. This > > > was corroborated by talking to various people at devsummits and > > > EuroBSDCon with Kristof Provost specifically suggesting that > > > if_bridge was still under giant lock and there might be a problem > > > here that the lock is not released under some race condition and > > > then the entire bridge subsystem would be stalled. That sounds > > > plausible given the random occurrance. > > > > > > But I think we can rule out that one, because: > > > > > > - ifconfig up/down does not help > > > - the host is still communicating fine over the same bridge > > > interface > > > - tearing down the bridge, kldunload (!) of if_bridge.ko followed > > > by a new kldload and reconstructing the members with `ifconfig > > > addm` does not help, either > > > - only a host reboot restores function > > > > > > Finally I created a not iocage managed jail on the problem host. > > > Please ignore the `iocage` in the path, I used it to populate the > > > root directory. But it is not started by iocage at boot time and > > > the manual config is this: > > > > > > testjail { > > > host.hostname = "testjail"; # hostname > > > path = "/iocage/jails/testjail/root"; # root directory > > > exec.clean; > > > exec.system_user = "root"; > > > exec.jail_user = "root"; > > > vnet; > > > vnet.interface = "epair999b"; > > > exec.prestart += "ifconfig epair999 create; ifconfig > > > epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal"; > > > exec.poststop += "sleep 2; ifconfig epair999a destroy; sleep 2"; > > > # Standard stuff > > > exec.start += "/bin/sh /etc/rc"; > > > exec.stop = "/bin/sh /etc/rc.shutdown"; > > > exec.consolelog = "/var/log/jail_testjail_console.log"; > > > mount.devfs; #mount devfs > > > allow.raw_sockets; #allow ping-pong > > > devfs_ruleset="4"; #devfs ruleset for this jail > > > } > > > > > > $ cat /iocage/jails/testjail/root/etc/rc.conf > > > hostname="testjail" > > > > > > ifconfig_epair999b_ipv6="inet6 2A00:B580:8000:8000::2/64 > > > auto_linklocal" > > > > > > When I do `service jail onestart testjail` I can then ping6 the > > > jail from the host and the host from the jail. As you can see the > > > if_bridge is not involved in this traffic. > > > > > > When the host is in the wedged state and I start this testjail the > > > same way, no communication across the epair interface is possible. > > > > > > To me this seems to indicate that not the bridge but all epair > > > interfaces stop working at the very same time. > > > > > > > > > OS is RELENG_11_3, hardware and specifically network adapters > > > vary, we have igb, ix, ixl, bnxt ... > > > > > > > > > Does anyone have a suggestion what diagnostic measures could help > > > to pinpoint the culprit? The random occurrance and the fact that > > > the problem seems to prefer the production environment only makes > > > this a real pain ... > > > > > > > > > Thanks and kind regards, > > > Patrick > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscr...@freebsd.org" _______________________________________________ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"