[ovs-discuss] ARP request packets put high pressure on the pinctrl thread in ovn-controller

Piotr Misiak via discuss Fri, 14 Feb 2025 06:57:41 -0800

Hi,

We are running several OpenStack/OVN regions with different sizes.
All of them have external networks connected to the Internet.
We are receiving a lot of packets to non used  (non provisioned)
destination IP addresses, I guess some bots scanning Internet.
This creates a lot of ARP requests which cannot be replied, because
those IP addresses are not configured anywhere yet.


Few days ago we upgraded one of our regions from OVN 22.09 to OVN
24.03 and basically we suddenly started having critical issues with
DNS resolving on VMs running in the OpenStack.
Generally non of DNS requests were successful, some of them was going
back after 5 minutes, sometimes even after 30 minutes. Yes, minutes
not seconds.

After some debugging we identified problematic OpenFlow flows which
send ARP request packets to ovn-controllers.
Those flows are created because we have around 400 ports in the
external-network and packet flooding flow have to be splitted.
Those flows are installed at the beginning of OF 39 table with
priority 110 which includes 170 resubmits:

cookie=0x28ef9c32, duration=829.596s, table=39, n_packets=117482,
n_bytes=4947460, idle_age=0, hard_age=58,
priority=110,reg6=0x9001,reg15=0x8000,metadata=0xba
actions=load:0->NXM_NX_REG6[],load:0x5a3->NXM_NX_REG15[],resubmit(,41),load:0x21af->NXM_NX_REG15[],resubmit(,41),load:0x8f->NXM_NX_REG15[],resubmit(,41),load:0x1374->NXM_NX_REG15[],resubmit(,41),load:0x5f->NXM_NX_REG15[],resubmit(,41),load:0x10b->NXM_NX_REG15[],resubmit(,41),load:0x106->NXM_NX_REG15[],resubmit(,41),load:0x13d9->NXM_NX_REG15[],resubmit(,41),load:0x4d->NXM_NX_REG15[],resubmit(,41),load:0x2202->NXM_NX_REG15[],resubmit(,41),load:0xb4->NXM_NX_REG15[],resubmit(,41),load:0x25ed->NXM_NX_REG15[],resubmit(,41),load:0x1b59->NXM_NX_REG15[],resubmit(,41),load:0x26b2->NXM_NX_REG15[],resubmit(,41),load:0x6a->NXM_NX_REG15[],resubmit(,41)
<<< CUT >>>
load:0x169a->NXM_NX_REG15[],resubmit(,41),controller(userdata=00.00.00.1b.00.00.00.00.00.00.90.01.00.00.80.00.27)

there is also second rule with 170 resubmits with controller() at the end:
controller(userdata=00.00.00.1b.00.00.00.00.00.00.90.02.00.00.80.00.27)

and also third rule with smaller number of resubmits without
controller. In total we have around 400 resubmits.

This was introduced in 24.03 version by this commit:
https://github.com/ovn-org/ovn/commit/325c7b203d8bfd12bc1285ad11390c1a55cd6717

What we see in the ovn-controller logs:

2025-02-12T20:35:41.490Z|10791|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received  packet-in | opcode=unrecognized(27)| OF_Table_ID=39|
OF_Cookie_ID=0x28ef9c32| in-port=60| src-mac=4e:15:bc:ac:36:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=A.A.A.A, dst-ip=B.B.B.B
2025-02-12T20:35:41.500Z|10792|pinctrl(ovn_pinctrl0)|DBG|pinctrl
received  packet-in | opcode=unrecognized(27)| OF_Table_ID=39|
OF_Cookie_ID=0x28ef9c32| in-port=65533| src-mac=4e:15:bc:ac:36:45,
dst-mac=ff:ff:ff:ff:ff:ff| src-ip=A.A.A.A, dst-ip=B.B.B.B

as you can see the same packet is looped thru the ovn-controller
twice. It's because we have 400 ports and this is covered by three
OpenFlow flows.

The funny thing is that those packets are dropped at the end of
OpenFlow table chain in the datapath. So they kill our ovn-controllers
performance to be finally dropped.
I'm including a small part of packet trace result here:

39. reg15=0x8000,metadata=0xba, priority 100, cookie 0x28ef9c32
    set_field:0->reg6
    set_field:0xe8->reg15
    resubmit(,41)
    41. priority 0
            set_field:0->reg0
            set_field:0->reg1
            set_field:0->reg2
            set_field:0->reg3
            set_field:0->reg4
            set_field:0->reg5
            set_field:0->reg6
            set_field:0->reg7
            set_field:0->reg8
            set_field:0->reg9
            resubmit(,42)
        42. metadata=0xba, priority 0, cookie 0x3372823b
            resubmit(,43)
        43. metadata=0xba,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,
priority 110, cookie 0xaabcf4fa
            resubmit(,44)
        44. metadata=0xba, priority 0, cookie 0x9b7d541f
            resubmit(,45)
        45. metadata=0xba, priority 65535, cookie 0xedb6d3de
            resubmit(,46)
        46. metadata=0xba, priority 65535, cookie 0x1dbceae
            resubmit(,47)
        47. metadata=0xba, priority 0, cookie 0xc1c2a264
            resubmit(,48)
        48. metadata=0xba, priority 0, cookie 0x640d65ba
            resubmit(,49)
        49. metadata=0xba, priority 0, cookie 0x78f2abc0
            resubmit(,50)
        50. metadata=0xba, priority 0, cookie 0x7b63c11c
            resubmit(,51)
        51. metadata=0xba,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,
priority 100, cookie 0xb055fd1c
            set_field:0/0x8000000000000000000000000000->xxreg0
            resubmit(,52)
        52. metadata=0xba, priority 0, cookie 0x4dd5d603
            resubmit(,64)
        64. priority 0
            resubmit(,65)
        65. reg15=0xe8,metadata=0xba, priority 100, cookie 0xfab6eb
            
clone(ct_clear,set_field:0->reg11,set_field:0->reg12,set_field:0/0xffff->reg13,set_field:0x25b->reg11,set_field:0x30a->reg12,set_field:0x252->metadata,set_field:0x1->reg14,set_field:0->reg10,set_field:0->reg15,set_field:0->reg0,set_field:0->reg1,set_field:0->reg2,set_field:0->reg3,set_field:0->reg4,set_field:0->reg5,set_field:0->reg6,set_field:0->reg7,set_field:0->reg8,set_field:0->reg9,resubmit(,8))
            ct_clear
            set_field:0->reg11
            set_field:0->reg12
            set_field:0/0xffff->reg13
            set_field:0x25b->reg11
            set_field:0x30a->reg12
            set_field:0x252->metadata
            set_field:0x1->reg14
            set_field:0->reg10
            set_field:0->reg15
            set_field:0->reg0
            set_field:0->reg1
            set_field:0->reg2
            set_field:0->reg3
            set_field:0->reg4
            set_field:0->reg5
            set_field:0->reg6
            set_field:0->reg7
            set_field:0->reg8
            set_field:0->reg9
            resubmit(,8)
         8. reg14=0x1,metadata=0x252,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,
priority 50, cookie 0x33587607
            
set_field:0xfa163e9f2f460000000000000000/0xffffffffffff0000000000000000->xxreg0
            resubmit(,9)
         9. metadata=0x252, priority 0, cookie 0x671d3d97
            set_field:0x4/0x4->xreg4
            resubmit(,10)
        10. reg9=0x4/0x4,metadata=0x252, priority 100, cookie 0xd21e0659
            resubmit(,79)
            79. reg0=0x2, priority 0
                    drop
            resubmit(,11)
        11. arp,metadata=0x252, priority 85, cookie 0xb5758416
            drop


What we can do to improve those ARP packets handling to not to send
them to ovn-controllers?
Maybe they can be dropped somewhere earlier in the table chain? They
are requesting a MAC address which OVN doesn't know. Why it tries to
flood it to all router ports in the external network?
Maybe we can implement this "too big" OpenFlow rule in a different way
and loop it inside the fast datapath, if possible?
I also noticed that IPv6 NS packets are processed via ovn-controller.
Why OVS can't create responses inside the fast datapath in a similar
way it creates responses to the ARP requests for known MACs?

This issue had a big influence on our cloud, because the same
ovn-controller thread is responsible for DHCP, DNS interception, IPv6
NS packets and when they were overloaded all those services were not
working.

Another thing, quite misleading, are those "opcode=unrecognized(27)"
in the ovn-controller log, which are unrecognized only because I guess
the mentioned commit haven't added new action name mapping somewhere
here:
https://github.com/ovn-org/ovn/blob/ed2790153c07a376890f28b0a16bc321e3af016b/lib/actions.c#L5977

To recover our region we disabled the DNS interception and lowered
number of ARP requests by increasing
"net.ipv4.neigh.default.retrans_time_ms" on our upstream gateways.
Those changes lowered number of packets sent to ovn-controllers from
around 500 p/s to 200 p/s and stabilized our region.
Nevertheless this OVN performance issue is still there.

Thanks for your attention,
Piotr Misiak
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

[ovs-discuss] ARP request packets put high pressure on the pinctrl thread in ovn-controller

Reply via email to