On December 20, 2019 11:22:01 AM UTC, Marko Zec <z...@fer.hr> wrote: >Perhaps you could ditch if_bridge(4) and epair(4), and try ng_eiface(4) >with ng_bridge(4) instead? Works rock-solid 24/7 here on 11.2 / 11.3. > >Marko > >On Fri, 20 Dec 2019 11:19:24 +0100 >"Patrick M. Hausen" <hau...@punkt.de> wrote: > >> Hi all, >> >> we still experience occasional network outages in production, >> yet have not been able to find the root cause. >> >> We run around 50 servers with VNET jails. some of them with >> a handful, the busiest ones with 50 or more jails each. >> >> Every now and then the jails are not reachable over the net, >> anymore. The server itself is up and running, all jails are >> up and running, one can ssh to the server but none of the >> jails can communicate over the network. >> >> There seems to be no pattern to the time of occurrance except >> that more jails on one system make it "more likely". >> Also having more than one bridge, e.g. for private networks >> between jails seems to increase the probability. >> When a server shows the problem it tends to get into the state >> rather frequently, a couple of hours inbetween. Then again >> most servers run for weeks without exhibiting the problem. >> That's what makes it so hard to reproduce. The last couple of >> days one system was failing regularly until we reduced the number >> of jails from around 80 to around 50. Now it seems stable again. >> >> I have a test system with lots of jails that I work with gatling >> that did not show a single failure so far :-( >> >> >> Setup: >> >> All jails are iocage jails with VNET interfaces. They are >> connected to at least one bridge that starts with the >> physical external interface as a member and gets jails' >> epair interfaces added as they start up. All jails are managed >> by iocage. >> >> ifconfig_igb0="-rxcsum -rxcsum6 -txcsum -txcsum6 -vlanhwtag >> -vlanhwtso up" cloned_interfaces="bridge0" >> ifconfig_bridge0_name="inet0" >> ifconfig_inet0="addm igb0 up" >> ifconfig_inet0_ipv6="inet6 <host-address>/64 auto_linklocal" >> >> $ iocage get interfaces vpro0087 >> vnet0:inet0 >> >> $ ifconfig inet0 >> inet0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 >> mtu 1500 ether 90:1b:0e:63:ef:51 >> inet6 fe80::921b:eff:fe63:ef51%inet0 prefixlen 64 scopeid 0x4 >> inet6 <host-address> prefixlen 64 >> nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> >> groups: bridge >> id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 >> maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 >> root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 >> member: vnet0.4 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 7 priority 128 path cost 2000 >> member: vnet0.1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 6 priority 128 path cost 2000 >> member: igb0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP> >> ifmaxaddr 0 port 1 priority 128 path cost 2000000 >> >> >> What we tried: >> >> At first we suspected the bridge to become "wedged" somehow. This was >> corroborated by talking to various people at devsummits and >EuroBSDCon >> with Kristof Provost specifically suggesting that if_bridge was >> still under giant lock and there might be a problem here that the >> lock is not released under some race condition and then the entire >> bridge subsystem would be stalled. That sounds plausible given the >> random occurrance. >> >> But I think we can rule out that one, because: >> >> - ifconfig up/down does not help >> - the host is still communicating fine over the same bridge interface >> - tearing down the bridge, kldunload (!) of if_bridge.ko followed by >> a new kldload and reconstructing the members with `ifconfig addm` >> does not help, either >> - only a host reboot restores function >> >> Finally I created a not iocage managed jail on the problem host. >> Please ignore the `iocage` in the path, I used it to populate the >> root directory. But it is not started by iocage at boot time and >> the manual config is this: >> >> testjail { >> host.hostname = "testjail"; # hostname >> path = "/iocage/jails/testjail/root"; # root directory >> exec.clean; >> exec.system_user = "root"; >> exec.jail_user = "root"; >> vnet; >> vnet.interface = "epair999b"; >> exec.prestart += "ifconfig epair999 create; ifconfig >> epair999a inet6 2A00:B580:8000:8000::1/64 auto_linklocal"; >> exec.poststop += "sleep 2; ifconfig epair999a destroy; sleep 2"; >> # Standard stuff >> exec.start += "/bin/sh /etc/rc"; >> exec.stop = "/bin/sh /etc/rc.shutdown"; >> exec.consolelog = "/var/log/jail_testjail_console.log"; >> mount.devfs; #mount devfs >> allow.raw_sockets; #allow ping-pong >> devfs_ruleset="4"; #devfs ruleset for this jail >> } >> >> $ cat /iocage/jails/testjail/root/etc/rc.conf >> hostname="testjail" >> >> ifconfig_epair999b_ipv6="inet6 2A00:B580:8000:8000::2/64 >> auto_linklocal" >> >> When I do `service jail onestart testjail` I can then ping6 the jail >> from the host and the host from the jail. As you can see the >> if_bridge is not involved in this traffic. >> >> When the host is in the wedged state and I start this testjail the >> same way, no communication across the epair interface is possible. >> >> To me this seems to indicate that not the bridge but all epair >> interfaces stop working at the very same time. >> >> >> OS is RELENG_11_3, hardware and specifically network adapters vary, >> we have igb, ix, ixl, bnxt ... >> >> >> Does anyone have a suggestion what diagnostic measures could help to >> pinpoint the culprit? The random occurrance and the fact that the >> problem seems to prefer the production environment only makes this a >> real pain ... >> >> >> Thanks and kind regards, >> Patrick > >_______________________________________________ >freebsd-net@freebsd.org mailing list >https://lists.freebsd.org/mailman/listinfo/freebsd-net >To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Does it work with pf? -- Sent from my Android device with K-9 Mail. Please excuse my brevity. _______________________________________________ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"