from:"Damian Dąbrowski"

[Yahoo-eng-team] [Bug 1952907] [NEW] Gratuitous ARPs are not sent during master transition

2021-12-01 Thread Damian Dąbrowski

Public bug reported:

* High level description:

When a router transitions to MASTER state, keepalived should send GARPs but it 
fails because qg-* interface is down(it comes up about 1 sec after that, so it 
might be some race condition)
Keepalived should also send another GARPs after 60 seconds(garp_master_delay) 
but it doesn't(probably because first ones fail, but I'm not 100% sure).

When I add random port to this router to trigger keepalived's reload,
then all GARPs are sent properly(because netns is already configured and
qg-* interface is up for the whole time)


* Pre-conditions:

Operating System: Ubuntu 20.04
Keepalived version: 2.0.19
Affected neutron releases:
  - my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD as 
of 29.09.2021)
  - my prod env: Victoria
  - (most likely all versions after this change 
https://review.opendev.org/c/openstack/neutron/+/707406)


* Step-by-step reproduction:

Simply perform a failover on HA router.
The same goal may be also achieved by removing all l3 agents from the router, 
and then adding one, so:

# openstack router create neutron-bug
# openstack router set --external-gateway public neutron-bug
# neutron l3-agent-list-hosting-router neutron-bug
# (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug
# (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug
(GARPs are not sent)
# openstack router add port neutron-bug test-port
(GARPs are sent properly)

* Expected output:

Gratuitous ARPs should be sent from router's namespace during MASTER
transition.


* Actual output:

Gratuitous ARPs are not sent.
Keepalived complains about: Error 100 (Network is down) sending gratuitous ARP 
on qg-4a2f0239-5c for 172.29.249.194
qg-* interface wakes up about 1 second after keepalived tries to send GARPs.


* Attachments:

Keepalived logs: https://paste.openstack.org/raw/811372/
Interfaces inside router's netns + tcpdump from master transition: 
https://paste.openstack.org/raw/811373/

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952907

Title:
  Gratuitous ARPs are not sent during master transition

Status in neutron:
  New

Bug description:
  * High level description:

  When a router transitions to MASTER state, keepalived should send GARPs but 
it fails because qg-* interface is down(it comes up about 1 sec after that, so 
it might be some race condition)
  Keepalived should also send another GARPs after 60 seconds(garp_master_delay) 
but it doesn't(probably because first ones fail, but I'm not 100% sure).

  When I add random port to this router to trigger keepalived's reload,
  then all GARPs are sent properly(because netns is already configured
  and qg-* interface is up for the whole time)

  
  * Pre-conditions:

  Operating System: Ubuntu 20.04
  Keepalived version: 2.0.19
  Affected neutron releases:
- my AIO env: Xena (master/106fa3e6d3f0b1c32ef28fe9dd6b125b9317e9cf # HEAD 
as of 29.09.2021)
- my prod env: Victoria
- (most likely all versions after this change 
https://review.opendev.org/c/openstack/neutron/+/707406)

  
  * Step-by-step reproduction:

  Simply perform a failover on HA router.
  The same goal may be also achieved by removing all l3 agents from the router, 
and then adding one, so:

  # openstack router create neutron-bug
  # openstack router set --external-gateway public neutron-bug
  # neutron l3-agent-list-hosting-router neutron-bug
  # (for all l3 agents): neutron l3-agent-router-remove L3_AGENT_ID neutron-bug
  # (for a single l3 agent): neutron l3-agent-router-add L3_AGENT_ID neutron-bug
  (GARPs are not sent)
  # openstack router add port neutron-bug test-port
  (GARPs are sent properly)

  * Expected output:

  Gratuitous ARPs should be sent from router's namespace during MASTER
  transition.

  
  * Actual output:

  Gratuitous ARPs are not sent.
  Keepalived complains about: Error 100 (Network is down) sending gratuitous 
ARP on qg-4a2f0239-5c for 172.29.249.194
  qg-* interface wakes up about 1 second after keepalived tries to send GARPs.

  
  * Attachments:

  Keepalived logs: https://paste.openstack.org/raw/811372/
  Interfaces inside router's netns + tcpdump from master transition: 
https://paste.openstack.org/raw/811373/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952907/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2047182] [NEW] BFV VM may be unexpectedly moved to different AZ

2023-12-21 Thread Damian Dąbrowski

Public bug reported:

In cases when:
- each availability zone has a separate storage
cluster([cinder]/cross_az_attach option helps to achieve that)
and
- there is no default_schedule_zone
VM may be unexpectedly moved to different AZ.

When a VM is created from pre-existing volume, nova places the specific
availability zone in request_specs which prevents a VM from being moved
to different AZ during resize/migrate[1]. In this case, everything works
fine.

Unfortunately, problems start in the following cases:
a) VM is created with --boot-from-volume argument which dynamically creates
volume for the VM
b) VM has only ephemeral volume

Lets focus on case a) because option b) may be not working "by design".

_get_volume_from_bdms() method considers only pre-existing volumes[2]. Volume
that will be created later on with `--boot-from-volume` does not exist yet so
it cannot fetch its availability zone.
As a result, request_specs contains '"availability_zone": null' and VM can be
moved to different AZ during resize/migrate. Because storage is not shared
between AZs, it breaks a VM.

It's not easy to fix because:
- nova API is not aware of the designated AZ at the time of placing
request_specs in DB
- looking at schedule_and_build_instances method[3] we do not create the cinder
volumes before downcalling to the compute agent. And we do not allow upcalls
from the compute-agent to the api db in general, so it's hard to update
request_specs after the volume is created.

Unfortunately, at this point I don't see any easy way to fix this issue.

[1]
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1268C19
[2]
https://github.com/openstack/nova/blob/d28a55959e50b472e181809b919e11a896f989e3/nova/compute/api.py#L1247
[3]
https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L1646

** Affects: nova
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2047182

Title:
BFV VM may be unexpectedly moved to different AZ

Status in OpenStack Compute (nova):
New

Bug description:
In cases when:
- each availability zone has a separate storage
cluster([cinder]/cross_az_attach option helps to achieve that)
and
- there is no default_schedule_zone
VM may be unexpectedly moved to different AZ.

When a VM is created from pre-existing volume, nova places the
specific availability zone in request_specs which prevents a VM from
being moved to different AZ during resize/migrate[1]. In this case,
everything works fine.

Unfortunately, problems start in the following cases:
a) VM is created with --boot-from-volume argument which dynamically creates
volume for the VM
b) VM has only ephemeral volume

Lets focus on case a) because option b) may be not working "by
design".

It's not easy to fix because:
- nova API is not aware of the designated AZ at the time of placing
request_specs in DB
- looking at schedule_and_build_instances method[3] we do not create the
cinder volumes before downcalling to the compute agent. And we do not allow
upcalls from the compute-agent to the api db in general, so it's hard to update
request_specs after the volume is created.

Unfortunately, at this point I don't see any easy way to fix this
issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2047182/+subscriptions

--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2077879] [NEW] [OVN] Router gateway does not have a chassis defined Edit

2024-08-26 Thread Damian Dąbrowski

Public bug reported:

# Problem Description

I have:

- 2 VMs connected to inner-net
- inner-router(with default gateway in outer-net and a port in inner-net)
- outer router(with default gateway in public network and a port in outer-net)

NOTE: I don't have any static routes defined for these routers.

Graphical visualization can be found here:
https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png

This scenario works perfectly fine for OVS ML2 driver(VMs have Internet
connectivity), but not for OVN.

I noticed that gateway port for inner-router is DOWN(you can see this on
the above screenshot) which looks quite suspicious.

I applied https://review.opendev.org/c/openstack/neutron/+/907504 but it
didn't solve the problem.

# Further Investigation

I noticed that inner-router's gateway interface does not have a chassis
assigned:

```
router 7a5baad4-657d-42fc-bf35-1b8e4115050e 
(neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router)
port lrp-a221d264-8fa3-4430-99f7-f453887b96aa
mac: "fa:16:3e:af:b0:ae"
networks: ["10.10.0.60/24"]
port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2
mac: "fa:16:3e:05:30:57"
networks: ["10.0.0.1/24"]
nat 8fca2dfd-2284-4e18-98be-137606f0f0b9
external ip: "10.10.0.60"
logical ip: "10.0.0.0/24"
type: "snat"
```

I fixed it with `ovn-nbctl lrp-set-gateway-chassis
lrp-a221d264-8fa3-4430-99f7-f453887b96aa
efcb326f-f18c-4b65-9da9-260dd0e2e603`.

Now everything looks good. Internet connectivity is working and neutron
gateway port(10.10.0.60) is ACTIVE instead of DOWN.

# How to reproduce the issue

(assuming that you have a test environment with 'public' network already
defined)

```
openstack network create outer-net --external --disable-port-security
openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 
outer-subnet
openstack router create outer-router --external-gateway public
openstack router add subnet outer-router outer-subnet

openstack network create inner-net --disable-port-security
openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 
inner-subnet
openstack router create --external-gateway outer-net inner-router
openstack router add subnet inner-router inner-subnet

openstack server create \
--network inner-net \
--image 'cirros' \
--flavor 'tempest1' \
vm-inner-1
```

Then, log in to vm-inner-1 and try to ping 8.8.8.8.
For OVS it works, for OVN it doesn't.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2077879

Title:
  [OVN] Router gateway does not have a chassis defined Edit

Status in neutron:
  New

Bug description:
  # Problem Description

  I have:

  - 2 VMs connected to inner-net
  - inner-router(with default gateway in outer-net and a port in inner-net)
  - outer router(with default gateway in public network and a port in outer-net)

  NOTE: I don't have any static routes defined for these routers.

  Graphical visualization can be found here:
  https://i.ibb.co/gzjd604/Screenshot-from-2024-08-20-13-26-55.png

  This scenario works perfectly fine for OVS ML2 driver(VMs have
  Internet connectivity), but not for OVN.

  I noticed that gateway port for inner-router is DOWN(you can see this
  on the above screenshot) which looks quite suspicious.

  I applied https://review.opendev.org/c/openstack/neutron/+/907504 but
  it didn't solve the problem.

  # Further Investigation

  I noticed that inner-router's gateway interface does not have a
  chassis assigned:

  ```
  router 7a5baad4-657d-42fc-bf35-1b8e4115050e 
(neutron-028eb3f7-af0b-4080-87d6-e84b24675b6d) (aka inner-router)
  port lrp-a221d264-8fa3-4430-99f7-f453887b96aa
  mac: "fa:16:3e:af:b0:ae"
  networks: ["10.10.0.60/24"]
  port lrp-9ac7815d-75dc-4198-aa94-bfe5ad5431e2
  mac: "fa:16:3e:05:30:57"
  networks: ["10.0.0.1/24"]
  nat 8fca2dfd-2284-4e18-98be-137606f0f0b9
  external ip: "10.10.0.60"
  logical ip: "10.0.0.0/24"
  type: "snat"
  ```

  I fixed it with `ovn-nbctl lrp-set-gateway-chassis
  lrp-a221d264-8fa3-4430-99f7-f453887b96aa
  efcb326f-f18c-4b65-9da9-260dd0e2e603`.

  Now everything looks good. Internet connectivity is working and
  neutron gateway port(10.10.0.60) is ACTIVE instead of DOWN.

  # How to reproduce the issue

  (assuming that you have a test environment with 'public' network
  already defined)

  ```
  openstack network create outer-net --external --disable-port-security
  openstack subnet create --network outer-net --subnet-range 10.10.0.0/24 
outer-subnet
  openstack router create outer-router --external-gateway public
  openstack router add subnet outer-router outer-subnet

  openstack network create inner-net --disable-port-security
  openstack subnet create --network inner-net --subnet-range 10.0.0.0/24 
inner-subnet

[Yahoo-eng-team] [Bug 1952907] [NEW] Gratuitous ARPs are not sent during master transition

[Yahoo-eng-team] [Bug 2047182] [NEW] BFV VM may be unexpectedly moved to different AZ

[Yahoo-eng-team] [Bug 2077879] [NEW] [OVN] Router gateway does not have a chassis defined Edit

3 matches

Site Navigation

Mail list logo

Footer information