Hi, We are running several OpenStack/OVN regions with different sizes. All of them have external networks connected to the Internet. We are receiving a lot of packets to non used (non provisioned) destination IP addresses, I guess some bots scanning Internet. This creates a lot of ARP requests which cannot be replied, because those IP addresses are not configured anywhere yet.
Few days ago we upgraded one of our regions from OVN 22.09 to OVN 24.03 and basically we suddenly started having critical issues with DNS resolving on VMs running in the OpenStack. Generally non of DNS requests were successful, some of them was going back after 5 minutes, sometimes even after 30 minutes. Yes, minutes not seconds. After some debugging we identified problematic OpenFlow flows which send ARP request packets to ovn-controllers. Those flows are created because we have around 400 ports in the external-network and packet flooding flow have to be splitted. Those flows are installed at the beginning of OF 39 table with priority 110 which includes 170 resubmits: cookie=0x28ef9c32, duration=829.596s, table=39, n_packets=117482, n_bytes=4947460, idle_age=0, hard_age=58, priority=110,reg6=0x9001,reg15=0x8000,metadata=0xba actions=load:0->NXM_NX_REG6[],load:0x5a3->NXM_NX_REG15[],resubmit(,41),load:0x21af->NXM_NX_REG15[],resubmit(,41),load:0x8f->NXM_NX_REG15[],resubmit(,41),load:0x1374->NXM_NX_REG15[],resubmit(,41),load:0x5f->NXM_NX_REG15[],resubmit(,41),load:0x10b->NXM_NX_REG15[],resubmit(,41),load:0x106->NXM_NX_REG15[],resubmit(,41),load:0x13d9->NXM_NX_REG15[],resubmit(,41),load:0x4d->NXM_NX_REG15[],resubmit(,41),load:0x2202->NXM_NX_REG15[],resubmit(,41),load:0xb4->NXM_NX_REG15[],resubmit(,41),load:0x25ed->NXM_NX_REG15[],resubmit(,41),load:0x1b59->NXM_NX_REG15[],resubmit(,41),load:0x26b2->NXM_NX_REG15[],resubmit(,41),load:0x6a->NXM_NX_REG15[],resubmit(,41) <<< CUT >>> load:0x169a->NXM_NX_REG15[],resubmit(,41),controller(userdata=00.00.00.1b.00.00.00.00.00.00.90.01.00.00.80.00.27) there is also second rule with 170 resubmits with controller() at the end: controller(userdata=00.00.00.1b.00.00.00.00.00.00.90.02.00.00.80.00.27) and also third rule with smaller number of resubmits without controller. In total we have around 400 resubmits. This was introduced in 24.03 version by this commit: https://github.com/ovn-org/ovn/commit/325c7b203d8bfd12bc1285ad11390c1a55cd6717 What we see in the ovn-controller logs: 2025-02-12T20:35:41.490Z|10791|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=unrecognized(27)| OF_Table_ID=39| OF_Cookie_ID=0x28ef9c32| in-port=60| src-mac=4e:15:bc:ac:36:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=A.A.A.A, dst-ip=B.B.B.B 2025-02-12T20:35:41.500Z|10792|pinctrl(ovn_pinctrl0)|DBG|pinctrl received packet-in | opcode=unrecognized(27)| OF_Table_ID=39| OF_Cookie_ID=0x28ef9c32| in-port=65533| src-mac=4e:15:bc:ac:36:45, dst-mac=ff:ff:ff:ff:ff:ff| src-ip=A.A.A.A, dst-ip=B.B.B.B as you can see the same packet is looped thru the ovn-controller twice. It's because we have 400 ports and this is covered by three OpenFlow flows. The funny thing is that those packets are dropped at the end of OpenFlow table chain in the datapath. So they kill our ovn-controllers performance to be finally dropped. I'm including a small part of packet trace result here: 39. reg15=0x8000,metadata=0xba, priority 100, cookie 0x28ef9c32 set_field:0->reg6 set_field:0xe8->reg15 resubmit(,41) 41. priority 0 set_field:0->reg0 set_field:0->reg1 set_field:0->reg2 set_field:0->reg3 set_field:0->reg4 set_field:0->reg5 set_field:0->reg6 set_field:0->reg7 set_field:0->reg8 set_field:0->reg9 resubmit(,42) 42. metadata=0xba, priority 0, cookie 0x3372823b resubmit(,43) 43. metadata=0xba,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00, priority 110, cookie 0xaabcf4fa resubmit(,44) 44. metadata=0xba, priority 0, cookie 0x9b7d541f resubmit(,45) 45. metadata=0xba, priority 65535, cookie 0xedb6d3de resubmit(,46) 46. metadata=0xba, priority 65535, cookie 0x1dbceae resubmit(,47) 47. metadata=0xba, priority 0, cookie 0xc1c2a264 resubmit(,48) 48. metadata=0xba, priority 0, cookie 0x640d65ba resubmit(,49) 49. metadata=0xba, priority 0, cookie 0x78f2abc0 resubmit(,50) 50. metadata=0xba, priority 0, cookie 0x7b63c11c resubmit(,51) 51. metadata=0xba,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00, priority 100, cookie 0xb055fd1c set_field:0/0x8000000000000000000000000000->xxreg0 resubmit(,52) 52. metadata=0xba, priority 0, cookie 0x4dd5d603 resubmit(,64) 64. priority 0 resubmit(,65) 65. reg15=0xe8,metadata=0xba, priority 100, cookie 0xfab6eb clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0/0xffff->reg13,set_field:0x25b->reg11,set_field:0x30a->reg12,set_field:0x252->metadata,set_field:0x1->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,resubmit(,8)) ct_clear set_field:0->reg11 set_field:0->reg12 set_field:0/0xffff->reg13 set_field:0x25b->reg11 set_field:0x30a->reg12 set_field:0x252->metadata set_field:0x1->reg14 set_field:0->reg10 set_field:0->reg15 set_field:0->reg0 set_field:0->reg1 set_field:0->reg2 set_field:0->reg3 set_field:0->reg4 set_field:0->reg5 set_field:0->reg6 set_field:0->reg7 set_field:0->reg8 set_field:0->reg9 resubmit(,8) 8. reg14=0x1,metadata=0x252,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00, priority 50, cookie 0x33587607 set_field:0xfa163e9f2f460000000000000000/0xffffffffffff0000000000000000->xxreg0 resubmit(,9) 9. metadata=0x252, priority 0, cookie 0x671d3d97 set_field:0x4/0x4->xreg4 resubmit(,10) 10. reg9=0x4/0x4,metadata=0x252, priority 100, cookie 0xd21e0659 resubmit(,79) 79. reg0=0x2, priority 0 drop resubmit(,11) 11. arp,metadata=0x252, priority 85, cookie 0xb5758416 drop What we can do to improve those ARP packets handling to not to send them to ovn-controllers? Maybe they can be dropped somewhere earlier in the table chain? They are requesting a MAC address which OVN doesn't know. Why it tries to flood it to all router ports in the external network? Maybe we can implement this "too big" OpenFlow rule in a different way and loop it inside the fast datapath, if possible? I also noticed that IPv6 NS packets are processed via ovn-controller. Why OVS can't create responses inside the fast datapath in a similar way it creates responses to the ARP requests for known MACs? This issue had a big influence on our cloud, because the same ovn-controller thread is responsible for DHCP, DNS interception, IPv6 NS packets and when they were overloaded all those services were not working. Another thing, quite misleading, are those "opcode=unrecognized(27)" in the ovn-controller log, which are unrecognized only because I guess the mentioned commit haven't added new action name mapping somewhere here: https://github.com/ovn-org/ovn/blob/ed2790153c07a376890f28b0a16bc321e3af016b/lib/actions.c#L5977 To recover our region we disabled the DNS interception and lowered number of ARP requests by increasing "net.ipv4.neigh.default.retrans_time_ms" on our upstream gateways. Those changes lowered number of packets sent to ovn-controllers from around 500 p/s to 200 p/s and stabilized our region. Nevertheless this OVN performance issue is still there. Thanks for your attention, Piotr Misiak _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss