Hi, I would like to known if somebody is already working on nftables ? Recently, I had scalibity problem with big hosts with a lof of vms interfaces.
This was an host with 500vms with 3 interfaces by vms. (so 1500 tap interfaces + 1500 fwbr + 1500 ) The problems: - ebtables-restore-legacy is not able to import big ruleset. (seem to works with ebtables-restore-nft). https://bugzilla.proxmox.com/show_bug.cgi?id=3909 - pve-firewall rule generation take 100% cpu for 5s (on a new epyc server 3ghz), iptables-save/restore is slow too (but working). With the 10s interval of pve-firewall running, I have almost all the the time the pve-firewall process running at 100%. - with the current 1 fwbr for each interfaces, when a broadcast (like arp) is going to the main bridge, the packet is duplicated/forward on each fwbr. The current arp forwarding only use 1 ksoftirqd with a slow cpu path (I have check with "perf record). with a lot of fwbr, I had a 100% ksoftirqd with packet loss. (200 original arp request/S * 500 fwbr = >100000 arp request/s to handle) I have looked at nftables, I think that everything is ready in kernel now.(last missing part with bridge conntrack from kernel 5.3) Here a basic example, with conntrack at bridge level and vmap feature to match to interface. #!/usr/sbin/nft -f flush ruleset table inet filter { chain input { type filter hook input priority 0; policy accept; log flags all prefix "host in" } chain forward { type filter hook forward priority 0; policy accept; log flags all prefix "host forward (routing)" } chain output { type filter hook output priority 0; policy accept; log flags all prefix "host output" } } table bridge filter { chain forward { type filter hook forward priority 0; policy accept; ct state established,related accept log flags all prefix "bridge forward" iifname vmap { tap100i0 : jump tap100i0-out , tap105i0 : jump tap105i0-out } oifname vmap { tap100i0 : jump tap100i0-in , tap105i0 : jump tap105i0-in } } chain tap100i0-in { log flags all prefix "tap100i0-in" ether type arp accept drop } chain tap100i0-out { log flags all prefix "tap100i0-out" ether type arp accept return } chain tap105i0-in { log flags all prefix "tap1005i0-in" ether type arp accept } chain tap105i0-out { log flags all prefix "tap105i0-out" ether type arp accept return } } Also, I think we could avoid the use the fwbr for some cases. AFAIK, the fwbr is only need for host->vm, because without fwbr, we only have the packet in host output chain (or forward for routing setup), without the destination tap interface (only the destination bridge and destination ip) ex: routed setup :10.3.94.11----->10.3.94.1(vmbr0)--- (vmbr1)192.168.0.1-----vm(192.168.0.10) kernel: [28341.361776] forward hostIN=eth0 OUT=vmbr1 MACSRC=f2:42:cf:23:12:88 MACDST=24:8a:07:9a:2a:f2 MACPROTO=0800 SRC=10.3.94.11 DST=192.168.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=39355 DF PROTO=ICMP TYPE=8 CODE=0 ID=48423 SEQ=1 with the fwbr, we can match the packet twice, in the host output/forward, and in the bridge forward. I'm not able to reproduce this with nftables :( I see 1 possible clean workaround: - Don't setup ip address on the bridge directly, but instead, on a veth pair. Like this, we see veth source && tap destination is bridge forward. (some users had problem at hetzner with fwbr bridge sending packets with their own mac, this should avoid this bug) But for users that mean manual network config change or maybe some ifupdown2 tricks or config auto rewrite. ex: routed setup bridge forward IN=veth_host OUT=tap100i0 MACSRC=9a:cd:90:f8:f5:3e MACDST=04:05:df:12:85:55 MACPROTO=0800 SRC=10.3.94.11 DST=192.168.0.10 LEN=84 TOS=0x00 PREC=0x00 TTL=63 ID=10333 DF PROTO=ICMP TYPE=8 CODE=0 ID=46306 SEQ=1 I don't known if it's possible to get the fwbr tricks working, but it this case: - keep a fwbr bridge, but only 1 by vmbr where an ip is setup (or for openswitch too). more transparent to implement at vm start/stop. (but we still need to match the packet twice) For other cases (pure bridging), I think we don't need fwbr at all. This should avoid extra cpu cycle, and make network throughput faster too. Any opinion about this ? does somebody already have done test with nftables ? _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel