After investigating further, I believe I am hitting the following issue: https://bugs.launchpad.net/neutron/+bug/1995078
Essentially the external port and the LRP are being scheduled separately and without coordination. Because of this, if these ports are scheduled on different chassis the ARP request is dropped. Will need to build/test this fix and will follow up with a conclusion. Juniper Business Use Only From: Austin Cormier <acorm...@juniper.net> Date: Thursday, February 1, 2024 at 5:39 PM To: ovs-discuss@openvswitch.org <ovs-discuss@openvswitch.org> Subject: OVN/OVS no ARP response for internal router interface from external port Looking for some help troubleshooting why OVS is not generating a response for my internal router port on a VLAN tenant network. I’ve dug down as far as I am reasonably able to but need a quick boost here. The ARP request is coming from an external system which is on the appropriate VLAN for the tenant network. East/West traffic is working as expected as I am able to communicate successfully with another VM on that vlan tenant network. It APPEARS that the flow never gets generated on br-int in the appropriate controller. I am going to walk through my debugging steps starting from the NB database to OVS on the appropriate controller that should be generating the ARP response: In OVN NB, I have a router with a port. This is my internal gateway interface of 192.168.5.1 ----- router 21cd6ac3-4804-4c68-a683-9bba07d97967 (neutron-5d87debf-cf0b-4fac-ba49-01b7680368aa) (aka vlan_test) port lrp-8c020ac1-ae54-4aa7-a143-4440067e9f42 mac: "fa:16:3e:37:43:b3" networks: ["192.168.5.1/24"] port lrp-e18d09db-19d8-4362-8252-751e6974ef5e mac: "fa:16:3e:37:67:a3" networks: ["10.27.14.50/23"] gateway chassis: [infra-prod-controller-02 infra-prod-controller-01 infra-prod-controller-03] nat 40605ce2-3f93-4877-ac26-47e4b257fa5f external ip: "10.27.14.50" logical ip: "192.168.5.0/24" type: "snat" ---- The logical switch for the tenant network is connected to a localnet with tag 1106 and has the external port for my baremetal device and the appropriate router port. ---- switch be7c870d-6d9c-471f-8996-e48a551068a0 (neutron-fcc39c5e-d3d8-4400-b5cf-b10f76d0112b) (aka vlan_test) port provnet-5ddf8b20-c849-484a-ad8f-a86a9856c2b2 type: localnet tag: 1106 addresses: ["unknown"] port 1f01c94e-f32f-4e94-b02a-813bb1ad4a47 addresses: ["unknown"] port 85aa9a1e-cb84-4137-97ce-85958a948390 addresses: ["fa:16:3e:ca:6e:3b 192.168.5.188"] port 8c020ac1-ae54-4aa7-a143-4440067e9f42 type: router router-port: lrp-8c020ac1-ae54-4aa7-a143-4440067e9f42 port 19279d0c-7c9a-498e-a3e5-269933c49df6 type: localport addresses: ["fa:16:3e:4a:14:28 192.168.5.2"] port 4c9047c4-a6c7-4b27-9cfa-58b5d30ce964 type: external addresses: ["90:ec:77:32:e6:6e 192.168.5.56"] ---- In OVN SB, I can see that the external port (4c9047c4-a6c7-4b27-9cfa-58b5d30ce964) has been scheduled on infra-prod-controller-02. This is important because the ARP response would only get generated from a single HA Chassis. --- Chassis infra-prod-controller-02 hostname: infra-prod-controller-02 Encap geneve ip: "10.27.12.24" options: {csum="true"} Port_Binding cr-lrp-20ba6028-7220-4c8d-a20f-9e4c416da3f7 Port_Binding "71e436bb-7121-473e-a024-e34d4d7f4a4f" Port_Binding cr-lrp-c03b5dd9-92e1-4046-be1c-a953c0fab238 Port_Binding "f8eb9e30-e65f-44c4-94b6-a67700790880" Port_Binding cr-lrp-ae2f5dbb-2cd0-44d0-9061-71c8186440be Port_Binding "4c9047c4-a6c7-4b27-9cfa-58b5d30ce964" Port_Binding "eb4435ad-37f2-44f9-a786-470b18bb9f0d" Port_Binding cr-lrp-950eec85-b785-474b-837b-4ecbbcf080c9 Port_Binding "f3bbbe9a-a1a7-44b5-b6bc-b00a351ca1a5" Port_Binding "79430028-7ae3-448c-bc12-c9d7d44d218b" --- In OVN SB again, I can issue a trace command to verify that the Logical Flow exists to generate the ARP response: --- # ovn-trace neutron-fcc39c5e-d3d8-4400-b5cf-b10f76d0112b 'inport == "provnet-5ddf8b20-c849-484a-ad8f-a86a9856c2b2" && eth.src == 90:ec:77:32:e6:6e && eth.dst == ff:ff:ff:ff:ff:ff && arp.tpa == 192.168.5.1 && arp.spa == 192.168.5 .178 && arp.op == 1 && arp.tha == ff:ff:ff:ff:ff:ff && arp.sha == 90:ec:77:32:e6:6e' … ingress(dp="vlan_test", inport="lrp-8c020a") -------------------------------------------- 0. lr_in_admission (northd.c:12885): eth.mcast && inport == "lrp-8c020a", priority 50, uuid faac4787 xreg0[0..47] = fa:16:3e:37:43:b3; next; 1. lr_in_lookup_neighbor (northd.c:13142): inport == "lrp-8c020a" && arp.spa == 192.168.5.0/24 && arp.tpa == 192.168.5.1 && arp.op == 1, priority 110, uuid d447a962 reg9[2] = lookup_arp(inport, arp.spa, arp.sha); /* MAC binding to ff:ff:ff:ff:ff:ff found. */ reg9[3] = 1; next; 2. lr_in_learn_neighbor (northd.c:13078): reg9[2] == 1 || reg9[3] == 0, priority 100, uuid 6aad6f8d mac_cache_use; next; 3. lr_in_ip_input (northd.c:12440): inport == "lrp-8c020a" && arp.op == 1 && arp.tpa == 192.168.5.1 && arp.spa == 192.168.5.0/24 && is_chassis_resident("cr-lrp-e18d09"), priority 90, uuid 6187e537 eth.dst = eth.src; eth.src = xreg0[0..47]; arp.op = 2; arp.tha = arp.sha; arp.sha = xreg0[0..47]; arp.tpa <-> arp.spa; outport = inport; flags.loopback = 1; output; --- Moving into OVS land on infra-prod-controller-02 where the external port is scheduled, I am able to see the ARP requests entering br-ex: However, there is no response! --- ovs-tcpdump -i br-ex -nn -e -v tcpdump: listening on mibr-ex, link-type EN10MB (Ethernet), snapshot length 262144 bytes 22:05:48.581916 90:ec:77:32:e6:6e > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 64: vlan 1106, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.5.1 (ff:ff:ff:ff:ff:ff) tell 192.168.5.56, length 46 --- The flow exists on br-ex (with no reply). --- # ovs-appctl dpif/dump-flows --names br-ex recirc_id(0),in_port(bond0),ct_state(-new-est-rel-rpl-inv-trk),ct_mark(0/0x1),eth(src=90:ec:77:32:e6:6e,dst=ff:ff:ff:ff:ff:ff),eth_type(0x8100),vlan(vid=1106,pcp=0),encap(eth_type(0x0806),arp(sip=192.168.5.56,tip=192.168.5.1,op=1/0xff,sha=90:ec:77:32:e6:6e)), packets:2425, bytes:155200, used:0.633s, actions:mibond0,br-ex,pop_vlan,tapfcc39c5e-d0 --- I do NOT see the flow in `br-int`. --- (openvswitch-vswitchd)[root@infra-prod-controller-02 /]# ovs-appctl dpif/dump-flows --names br-int recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.21,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=1e:4d:dd:80:c3:ed),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6468, bytes:426888, used:0.068s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.13,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=9a:b2:a9:fd:a6:5b),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6456, bytes:426096, used:0.860s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.22,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=66:90:cc:1e:a5:b6),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6452, bytes:425832, used:0.716s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.12,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=96:54:aa:d3:29:b9),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6453, bytes:425898, used:0.729s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.15,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=8e:0b:17:26:df:ef),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6469, bytes:426954, used:0.609s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.19,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=5a:56:62:47:6f:21),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6457, bytes:426162, used:0.265s, actions:userspace(pid=4294967295,slow_path(bfd)) recirc_id(0),tunnel(tun_id=0x0,src=10.27.12.11,dst=10.27.12.24,flags(-df+csum+key)),in_port(genev_sys_6081),eth(src=ae:36:b4:02:1c:67),eth_type(0x0800),ipv4(proto=17,frag=no),udp(dst=3784), packets:6462, bytes:426492, used:0.253s, actions:userspace(pid=4294967295,slow_path(bfd)) --- I confirmed that tapfcc39c5e-d0 is indeed in the `br-int` Bridge. It is not marked internal for some reason, but unclear if that is by design. --- # ovs-vsctl show 678a6be7-51d4-44d4-9366-67311227cdbb Bridge br-int fail_mode: secure datapath_type: system … Port tapfcc39c5e-d0 Interface tapfcc39c5e-d0 --- When I issue the ofproto/trace commands, it appears that the OVS IS generating response, but it is definitely not being sent back out of the system: --- Trace ARP Request entering br-ex # ovs-appctl ofproto/trace --names br-ex in_port=1,dl_vlan=1106,dl_src=90:ec:77:32:e6:6e,dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x0806,arp_spa=192.168.5.56,arp_tpa=192.168.5.1,arp_op=1,arp_sha=90:ec:77:32:e6:6e,arp_tha=ff:ff:ff:ff:ff:ff … Final flow: unchanged Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_mark=0/0x1,eth,arp,in_port=bond0,dl_vlan=1106,dl_vlan_pcp=0,dl_src=90:ec:77:32:e6:6e,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=192.168.5.56,arp_tpa=192.168.5.1,arp_op=1,arp_sha=90:ec:77:32:e6:6e Datapath actions: mibond0,br-ex,pop_vlan,tapfcc39c5e-d0 --- --- Trace ARP Request entering br-int # ovs-appctl ofproto/trace --names br-int in_port=tapfcc39c5e-d0,dl_src=90:ec:77:32:e6:6e,dl_dst=ff:ff:ff:ff:ff:ff,dl_type=0x0806,arp_spa=192.168.5.56,arp_tpa=192.168.5.1,arp_op=1,arp_sha=90:ec:77:32:e6:6e,arp_tha=ff:ff:ff:ff:ff:ff … Final flow: arp,reg0=0x300,reg10=0x401,reg11=0x2d,reg12=0x2c,reg13=0x28,reg14=0x2,reg15=0x2,metadata=0x16,in_port="tapfcc39c5e-d0",vlan_tci=0x0000,dl_src=fa:16:3e:37:43:b3,dl_dst=90:ec:77:32:e6:6e,arp_spa=192.168.5.1,arp_tpa=192.168.5.56,arp_op=2,arp_sha=fa:16:3e:37:43:b3,arp_tha=90:ec:77:32:e6:6e Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_mark=0/0x1,eth,arp,in_port="tapfcc39c5e-d0",dl_src=90:ec:77:32:e6:6e,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=192.168.5.56,arp_tpa=192.168.5.1,arp_op=1,arp_sha=90:ec:77:32:e6:6e,arp_tha=ff:ff:ff:ff:ff:ff Datapath actions: set(eth(src=fa:16:3e:37:43:b3,dst=90:ec:77:32:e6:6e)),set(arp(sip=192.168.5.1,tip=192.168.5.56,op=2,sha=fa:16:3e:37:43:b3,tha=90:ec:77:32:e6:6e)),tapfcc39c5e-d0 This flow is handled by the userspace slow path because it: - Uses action(s) not supported by datapath. --- At this point, I would expect to see these responses actually hitting tapfcc39c5e-d0, but they do not seem to be present. It’s not clear if the above “Uses action(s) not supported by datapath” is an issue yet. --- # ovs-tcpdump -i tapfcc39c5e-d0 -nn -v -e tcpdump: listening on ovsmi100537, link-type EN10MB (Ethernet), snapshot length 262144 bytes 22:25:06.677218 90:ec:77:32:e6:6e > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.5.1 (ff:ff:ff:ff:ff:ff) tell 192.168.5.56, length 46 --- If the packets WERE to hit tapfcc39c5e-d0, it appears that the system would send the response out bond0 as expected: --- # ovs-appctl ofproto/trace --names br-int in_port=tapfcc39c5e-d0,dl_type=0x0806,dl_src=fa:16:3e:37:43:b3,dl_dst=90:ec:77:32:e6:6e,arp_spa=192.168.5.1,arp_tpa=192.168.5.56,arp_op=2,arp_sha=fa:16:3e:37:43:b3,arp_tha=90:ec:77:32:e6:6e … Final flow: arp,reg0=0x300,reg10=0x400,reg11=0x2d,reg12=0x2c,reg13=0x1b,reg14=0x2,reg15=0x8001,metadata=0x16,in_port="tapfcc39c5e-d0",vlan_tci=0x0000,dl_src=fa:16:3e:37:43:b3,dl_dst=90:ec:77:32:e6:6e,arp_spa=192.168.5.1,arp_tpa=192.168.5.56,arp_op=2,arp_sha=fa:16:3e:37:43:b3,arp_tha=90:ec:77:32:e6:6e Megaflow: recirc_id=0,ct_state=-new-est-rel-rpl-inv-trk,ct_mark=0/0x1,eth,arp,in_port="tapfcc39c5e-d0",dl_src=fa:16:3e:37:43:b3,dl_dst=90:ec:77:32:e6:6e,arp_tpa=192.168.5.56,arp_op=2 Datapath actions: push_vlan(vid=1106,pcp=0),bond0,mibond0 --- Any insight on what could be happening here or how to debug further would be GREATLY appreciated. -Austin
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss