[Yahoo-eng-team] [Bug 1952907] [NEW] Gratuitous ARPs are not sent during master transition
Public bug reported: * High level description: When a router transitions to MASTER state, keepalived should send GARPs but it fails because qg-* interface is down(it comes up about 1 sec after that, so it might be some race condition) Keepalived should also send another GARPs after 60 seconds(garp_master_delay) but it doesn't(probably because first ones fail, but I'm not 100% sure). When I add random port to this router to trigger keepalived's reload, then all GARPs are sent properly(because netns is already configured and qg-* interface is up for the whole time) * Pre-conditions: Operating System: Ubuntu 20.04 Keepalived version: 2.0.19 Affected neutron releases: - my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD as of 29.09.2021) - my prod env: Victoria - (most likely all versions after this change https://review.opendev.org/c/openstack/neutron/+/707406) * Step-by-step reproduction: Simply perform a failover on HA router. The same goal may be also achieved by removing all l3 agents from the router, and then adding one, so: # openstack router create neutron-bug # openstack router set --external-gateway public neutron-bug # neutron l3-agent-list-hosting-router neutron-bug # (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug # (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug (GARPs are not sent) # openstack router add port neutron-bug test-port (GARPs are sent properly) * Expected output: Gratuitous ARPs should be sent from router's namespace during MASTER transition. * Actual output: Gratuitous ARPs are not sent. Keepalived complains about: Error 100 (Network is down) sending gratuitous ARP on qg-4a2f0239-5c for 172.29.249.194 qg-* interface wakes up about 1 second after keepalived tries to send GARPs. * Attachments: Keepalived logs: https://paste.openstack.org/raw/811372/ Interfaces inside router's netns + tcpdump from master transition: https://paste.openstack.org/raw/811373/ ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1952907 Title: Gratuitous ARPs are not sent during master transition Status in neutron: New Bug description: * High level description: When a router transitions to MASTER state, keepalived should send GARPs but it fails because qg-* interface is down(it comes up about 1 sec after that, so it might be some race condition) Keepalived should also send another GARPs after 60 seconds(garp_master_delay) but it doesn't(probably because first ones fail, but I'm not 100% sure). When I add random port to this router to trigger keepalived's reload, then all GARPs are sent properly(because netns is already configured and qg-* interface is up for the whole time) * Pre-conditions: Operating System: Ubuntu 20.04 Keepalived version: 2.0.19 Affected neutron releases: - my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD as of 29.09.2021) - my prod env: Victoria - (most likely all versions after this change https://review.opendev.org/c/openstack/neutron/+/707406) * Step-by-step reproduction: Simply perform a failover on HA router. The same goal may be also achieved by removing all l3 agents from the router, and then adding one, so: # openstack router create neutron-bug # openstack router set --external-gateway public neutron-bug # neutron l3-agent-list-hosting-router neutron-bug # (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug # (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug (GARPs are not sent) # openstack router add port neutron-bug test-port (GARPs are sent properly) * Expected output: Gratuitous ARPs should be sent from router's namespace during MASTER transition. * Actual output: Gratuitous ARPs are not sent. Keepalived complains about: Error 100 (Network is down) sending gratuitous ARP on qg-4a2f0239-5c for 172.29.249.194 qg-* interface wakes up about 1 second after keepalived tries to send GARPs. * Attachments: Keepalived logs: https://paste.openstack.org/raw/811372/ Interfaces inside router's netns + tcpdump from master transition: https://paste.openstack.org/raw/811373/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1952907/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2047182] [NEW] BFV VM may be unexpectedly moved to different AZ
Public bug reported: In cases when: - each availability zone has a separate storage cluster([cinder]/cross_az_attach option helps to achieve that) and - there is no default_schedule_zone VM may be unexpectedly moved to different AZ. When a VM is created from pre-existing volume, nova places the specific availability zone in request_specs which prevents a VM from being moved to different AZ during resize/migrate[1]. In this case, everything works fine. Unfortunately, problems start in the following cases: a) VM is created with --boot-from-volume argument which dynamically creates volume for the VM b) VM has only ephemeral volume Lets focus on case a) because option b) may be not working "by design". _get_volume_from_bdms() method considers only pre-existing volumes[2]. Volume that will be created later on with `--boot-from-volume` does not exist yet so it cannot fetch its availability zone. As a result, request_specs contains '"availability_zone": null' and VM can be moved to different AZ during resize/migrate. Because storage is not shared between AZs, it breaks a VM. It's not easy to fix because: - nova API is not aware of the designated AZ at the time of placing request_specs in DB - looking at schedule_and_build_instances method[3] we do not create the cinder volumes before downcalling to the compute agent. And we do not allow upcalls from the compute-agent to the api db in general, so it's hard to update request_specs after the volume is created. Unfortunately, at this point I don't see any easy way to fix this issue. [1] https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1268C19 [2] https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1247 [3] https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1646 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2047182 Title: BFV VM may be unexpectedly moved to different AZ Status in OpenStack Compute (nova): New Bug description: In cases when: - each availability zone has a separate storage cluster([cinder]/cross_az_attach option helps to achieve that) and - there is no default_schedule_zone VM may be unexpectedly moved to different AZ. When a VM is created from pre-existing volume, nova places the specific availability zone in request_specs which prevents a VM from being moved to different AZ during resize/migrate[1]. In this case, everything works fine. Unfortunately, problems start in the following cases: a) VM is created with --boot-from-volume argument which dynamically creates volume for the VM b) VM has only ephemeral volume Lets focus on case a) because option b) may be not working "by design". _get_volume_from_bdms() method considers only pre-existing volumes[2]. Volume that will be created later on with `--boot-from-volume` does not exist yet so it cannot fetch its availability zone. As a result, request_specs contains '"availability_zone": null' and VM can be moved to different AZ during resize/migrate. Because storage is not shared between AZs, it breaks a VM. It's not easy to fix because: - nova API is not aware of the designated AZ at the time of placing request_specs in DB - looking at schedule_and_build_instances method[3] we do not create the cinder volumes before downcalling to the compute agent. And we do not allow upcalls from the compute-agent to the api db in general, so it's hard to update request_specs after the volume is created. Unfortunately, at this point I don't see any easy way to fix this issue. [1] https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1268C19 [2] https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1247 [3] https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1646 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2047182/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2077879] [NEW] [OVN] Router gateway does not have a chassis defined Edit
Public bug reported: # Problem Description I have: - 2 VMs connected to inner-net - inner-router(with default gateway in outer-net and a port in inner-net) - outer router(with default gateway in public network and a port in outer-net) NOTE: I don't have any static routes defined for these routers. Graphical visualization can be found here: https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png This scenario works perfectly fine for OVS ML2 driver(VMs have Internet connectivity), but not for OVN. I noticed that gateway port for inner-router is DOWN(you can see this on the above screenshot) which looks quite suspicious. I applied https://review.opendev.org/c/openstack/neutron/+/907504 but it didn't solve the problem. # Further Investigation I noticed that inner-router's gateway interface does not have a chassis assigned: ``` router 7a5baad4-657d-42fc-bf35-1b8e4115050e (neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router) port lrp-a221d264-8fa3-4430-99f7-f453887b96aa mac: "fa:16:3e:af:b0:ae" networks: ["10.10.0.60/24"] port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2 mac: "fa:16:3e:05:30:57" networks: ["10.0.0.1/24"] nat 8fca2dfd-2284-4e18-98be-137606f0f0b9 external ip: "10.10.0.60" logical ip: "10.0.0.0/24" type: "snat" ``` I fixed it with `ovn-nbctl lrp-set-gateway-chassis lrp-a221d264-8fa3-4430-99f7-f453887b96aa efcb326f-f18c-4b65-9da9-260dd0e2e603`. Now everything looks good. Internet connectivity is working and neutron gateway port(10.10.0.60) is ACTIVE instead of DOWN. # How to reproduce the issue (assuming that you have a test environment with 'public' network already defined) ``` openstack network create outer-net --external --disable-port-security openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 outer-subnet openstack router create outer-router --external-gateway public openstack router add subnet outer-router outer-subnet openstack network create inner-net --disable-port-security openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 inner-subnet openstack router create --external-gateway outer-net inner-router openstack router add subnet inner-router inner-subnet openstack server create \ --network inner-net \ --image 'cirros' \ --flavor 'tempest1' \ vm-inner-1 ``` Then, log in to vm-inner-1 and try to ping 8.8.8.8. For OVS it works, for OVN it doesn't. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2077879 Title: [OVN] Router gateway does not have a chassis defined Edit Status in neutron: New Bug description: # Problem Description I have: - 2 VMs connected to inner-net - inner-router(with default gateway in outer-net and a port in inner-net) - outer router(with default gateway in public network and a port in outer-net) NOTE: I don't have any static routes defined for these routers. Graphical visualization can be found here: https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png This scenario works perfectly fine for OVS ML2 driver(VMs have Internet connectivity), but not for OVN. I noticed that gateway port for inner-router is DOWN(you can see this on the above screenshot) which looks quite suspicious. I applied https://review.opendev.org/c/openstack/neutron/+/907504 but it didn't solve the problem. # Further Investigation I noticed that inner-router's gateway interface does not have a chassis assigned: ``` router 7a5baad4-657d-42fc-bf35-1b8e4115050e (neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router) port lrp-a221d264-8fa3-4430-99f7-f453887b96aa mac: "fa:16:3e:af:b0:ae" networks: ["10.10.0.60/24"] port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2 mac: "fa:16:3e:05:30:57" networks: ["10.0.0.1/24"] nat 8fca2dfd-2284-4e18-98be-137606f0f0b9 external ip: "10.10.0.60" logical ip: "10.0.0.0/24" type: "snat" ``` I fixed it with `ovn-nbctl lrp-set-gateway-chassis lrp-a221d264-8fa3-4430-99f7-f453887b96aa efcb326f-f18c-4b65-9da9-260dd0e2e603`. Now everything looks good. Internet connectivity is working and neutron gateway port(10.10.0.60) is ACTIVE instead of DOWN. # How to reproduce the issue (assuming that you have a test environment with 'public' network already defined) ``` openstack network create outer-net --external --disable-port-security openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 outer-subnet openstack router create outer-router --external-gateway public openstack router add subnet outer-router outer-subnet openstack network create inner-net --disable-port-security openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 inner-subnet