[Yahoo-eng-team] [Bug 1952907] [NEW] Gratuitous ARPs are not sent during master transition

2021-12-01 Thread Damian Dąbrowski
Public bug reported:

* High level description:

When a router transitions to MASTER state, keepalived should send GARPs but it 
fails because qg-* interface is down(it comes up about 1 sec after that, so it 
might be some race condition)
Keepalived should also send another GARPs after 60 seconds(garp_master_delay) 
but it doesn't(probably because first ones fail, but I'm not 100% sure).

When I add random port to this router to trigger keepalived's reload,
then all GARPs are sent properly(because netns is already configured and
qg-* interface is up for the whole time)


* Pre-conditions:

Operating System: Ubuntu 20.04
Keepalived version: 2.0.19
Affected neutron releases:
  - my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD as 
of 29.09.2021)
  - my prod env: Victoria
  - (most likely all versions after this change 
https://review.opendev.org/c/openstack/neutron/+/707406)


* Step-by-step reproduction:

Simply perform a failover on HA router.
The same goal may be also achieved by removing all l3 agents from the router, 
and then adding one, so:

# openstack router create neutron-bug
# openstack router set --external-gateway public neutron-bug
# neutron l3-agent-list-hosting-router neutron-bug
# (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug
# (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug
(GARPs are not sent)
# openstack router add port neutron-bug test-port
(GARPs are sent properly)

* Expected output:

Gratuitous ARPs should be sent from router's namespace during MASTER
transition.


* Actual output:

Gratuitous ARPs are not sent.
Keepalived complains about: Error 100 (Network is down) sending gratuitous ARP 
on qg-4a2f0239-5c for 172.29.249.194
qg-* interface wakes up about 1 second after keepalived tries to send GARPs.


* Attachments:

Keepalived logs: https://paste.openstack.org/raw/811372/
Interfaces inside router's netns + tcpdump from master transition: 
https://paste.openstack.org/raw/811373/

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952907

Title:
  Gratuitous ARPs are not sent during master transition

Status in neutron:
  New

Bug description:
  * High level description:

  When a router transitions to MASTER state, keepalived should send GARPs but 
it fails because qg-* interface is down(it comes up about 1 sec after that, so 
it might be some race condition)
  Keepalived should also send another GARPs after 60 seconds(garp_master_delay) 
but it doesn't(probably because first ones fail, but I'm not 100% sure).

  When I add random port to this router to trigger keepalived's reload,
  then all GARPs are sent properly(because netns is already configured
  and qg-* interface is up for the whole time)

  
  * Pre-conditions:

  Operating System: Ubuntu 20.04
  Keepalived version: 2.0.19
  Affected neutron releases:
- my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD 
as of 29.09.2021)
- my prod env: Victoria
- (most likely all versions after this change 
https://review.opendev.org/c/openstack/neutron/+/707406)

  
  * Step-by-step reproduction:

  Simply perform a failover on HA router.
  The same goal may be also achieved by removing all l3 agents from the router, 
and then adding one, so:

  # openstack router create neutron-bug
  # openstack router set --external-gateway public neutron-bug
  # neutron l3-agent-list-hosting-router neutron-bug
  # (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug
  # (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug
  (GARPs are not sent)
  # openstack router add port neutron-bug test-port
  (GARPs are sent properly)

  * Expected output:

  Gratuitous ARPs should be sent from router's namespace during MASTER
  transition.

  
  * Actual output:

  Gratuitous ARPs are not sent.
  Keepalived complains about: Error 100 (Network is down) sending gratuitous 
ARP on qg-4a2f0239-5c for 172.29.249.194
  qg-* interface wakes up about 1 second after keepalived tries to send GARPs.

  
  * Attachments:

  Keepalived logs: https://paste.openstack.org/raw/811372/
  Interfaces inside router's netns + tcpdump from master transition: 
https://paste.openstack.org/raw/811373/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952907/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2047182] [NEW] BFV VM may be unexpectedly moved to different AZ

2023-12-21 Thread Damian Dąbrowski
Public bug reported:

In cases when:
- each availability zone has a separate storage 
cluster([cinder]/cross_az_attach option helps to achieve that)
and
- there is no default_schedule_zone
VM may be unexpectedly moved to different AZ.

When a VM is created from pre-existing volume, nova places the specific
availability zone in request_specs which prevents a VM from being moved
to different AZ during resize/migrate[1]. In this case, everything works
fine.

Unfortunately, problems start in the following cases:
a) VM is created with --boot-from-volume argument which dynamically creates 
volume for the VM
b) VM has only ephemeral volume

Lets focus on case a) because option b) may be not working "by design".

_get_volume_from_bdms() method considers only pre-existing volumes[2]. Volume 
that will be created later on with `--boot-from-volume` does not exist yet so 
it cannot fetch its availability zone.
As a result, request_specs contains '"availability_zone": null' and VM can be 
moved to different AZ during resize/migrate. Because storage is not shared 
between AZs, it breaks a VM.

It's not easy to fix because:
- nova API is not aware of the designated AZ at the time of placing 
request_specs in DB
- looking at schedule_and_build_instances method[3] we do not create the cinder 
volumes before downcalling to the compute agent. And we do not allow upcalls 
from the compute-agent to the api db in general, so it's hard to update 
request_specs after the volume is created.

Unfortunately, at this point I don't see any easy way to fix this issue.

[1] 
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1268C19
[2] 
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1247
[3] 
https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1646

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2047182

Title:
  BFV VM may be unexpectedly moved to different AZ

Status in OpenStack Compute (nova):
  New

Bug description:
  In cases when:
  - each availability zone has a separate storage 
cluster([cinder]/cross_az_attach option helps to achieve that)
  and
  - there is no default_schedule_zone
  VM may be unexpectedly moved to different AZ.

  When a VM is created from pre-existing volume, nova places the
  specific availability zone in request_specs which prevents a VM from
  being moved to different AZ during resize/migrate[1]. In this case,
  everything works fine.

  Unfortunately, problems start in the following cases:
  a) VM is created with --boot-from-volume argument which dynamically creates 
volume for the VM
  b) VM has only ephemeral volume

  Lets focus on case a) because option b) may be not working "by
  design".

  _get_volume_from_bdms() method considers only pre-existing volumes[2]. Volume 
that will be created later on with `--boot-from-volume` does not exist yet so 
it cannot fetch its availability zone.
  As a result, request_specs contains '"availability_zone": null' and VM can be 
moved to different AZ during resize/migrate. Because storage is not shared 
between AZs, it breaks a VM.

  It's not easy to fix because:
  - nova API is not aware of the designated AZ at the time of placing 
request_specs in DB
  - looking at schedule_and_build_instances method[3] we do not create the 
cinder volumes before downcalling to the compute agent. And we do not allow 
upcalls from the compute-agent to the api db in general, so it's hard to update 
request_specs after the volume is created.

  Unfortunately, at this point I don't see any easy way to fix this
  issue.

  [1] 
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1268C19
  [2] 
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1247
  [3] 
https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1646

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2047182/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2077879] [NEW] [OVN] Router gateway does not have a chassis defined Edit

2024-08-26 Thread Damian Dąbrowski
Public bug reported:

# Problem Description

I have:

- 2 VMs connected to inner-net
- inner-router(with default gateway in outer-net and a port in inner-net)
- outer router(with default gateway in public network and a port in outer-net)

NOTE: I don't have any static routes defined for these routers.

Graphical visualization can be found here:
https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png

This scenario works perfectly fine for OVS ML2 driver(VMs have Internet
connectivity), but not for OVN.

I noticed that gateway port for inner-router is DOWN(you can see this on
the above screenshot) which looks quite suspicious.

I applied https://review.opendev.org/c/openstack/neutron/+/907504 but it
didn't solve the problem.

# Further Investigation

I noticed that inner-router's gateway interface does not have a chassis
assigned:

```
router 7a5baad4-657d-42fc-bf35-1b8e4115050e 
(neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router)
port lrp-a221d264-8fa3-4430-99f7-f453887b96aa
mac: "fa:16:3e:af:b0:ae"
networks: ["10.10.0.60/24"]
port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2
mac: "fa:16:3e:05:30:57"
networks: ["10.0.0.1/24"]
nat 8fca2dfd-2284-4e18-98be-137606f0f0b9
external ip: "10.10.0.60"
logical ip: "10.0.0.0/24"
type: "snat"
```

I fixed it with `ovn-nbctl lrp-set-gateway-chassis
lrp-a221d264-8fa3-4430-99f7-f453887b96aa
efcb326f-f18c-4b65-9da9-260dd0e2e603`.

Now everything looks good. Internet connectivity is working and neutron
gateway port(10.10.0.60) is ACTIVE instead of DOWN.

# How to reproduce the issue

(assuming that you have a test environment with 'public' network already
defined)

```
openstack network create outer-net --external --disable-port-security
openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 
outer-subnet
openstack router create outer-router --external-gateway public
openstack router add subnet outer-router outer-subnet

openstack network create inner-net --disable-port-security
openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 
inner-subnet
openstack router create --external-gateway outer-net inner-router
openstack router add subnet inner-router inner-subnet

openstack server create \
--network inner-net \
--image 'cirros' \
--flavor 'tempest1' \
vm-inner-1
```

Then, log in to vm-inner-1 and try to ping 8.8.8.8.
For OVS it works, for OVN it doesn't.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2077879

Title:
  [OVN] Router gateway does not have a chassis defined Edit

Status in neutron:
  New

Bug description:
  # Problem Description

  I have:

  - 2 VMs connected to inner-net
  - inner-router(with default gateway in outer-net and a port in inner-net)
  - outer router(with default gateway in public network and a port in outer-net)

  NOTE: I don't have any static routes defined for these routers.

  Graphical visualization can be found here:
  https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png

  This scenario works perfectly fine for OVS ML2 driver(VMs have
  Internet connectivity), but not for OVN.

  I noticed that gateway port for inner-router is DOWN(you can see this
  on the above screenshot) which looks quite suspicious.

  I applied https://review.opendev.org/c/openstack/neutron/+/907504 but
  it didn't solve the problem.

  # Further Investigation

  I noticed that inner-router's gateway interface does not have a
  chassis assigned:

  ```
  router 7a5baad4-657d-42fc-bf35-1b8e4115050e 
(neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router)
  port lrp-a221d264-8fa3-4430-99f7-f453887b96aa
  mac: "fa:16:3e:af:b0:ae"
  networks: ["10.10.0.60/24"]
  port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2
  mac: "fa:16:3e:05:30:57"
  networks: ["10.0.0.1/24"]
  nat 8fca2dfd-2284-4e18-98be-137606f0f0b9
  external ip: "10.10.0.60"
  logical ip: "10.0.0.0/24"
  type: "snat"
  ```

  I fixed it with `ovn-nbctl lrp-set-gateway-chassis
  lrp-a221d264-8fa3-4430-99f7-f453887b96aa
  efcb326f-f18c-4b65-9da9-260dd0e2e603`.

  Now everything looks good. Internet connectivity is working and
  neutron gateway port(10.10.0.60) is ACTIVE instead of DOWN.

  # How to reproduce the issue

  (assuming that you have a test environment with 'public' network
  already defined)

  ```
  openstack network create outer-net --external --disable-port-security
  openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 
outer-subnet
  openstack router create outer-router --external-gateway public
  openstack router add subnet outer-router outer-subnet

  openstack network create inner-net --disable-port-security
  openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 
inner-subnet