[Yahoo-eng-team] [Bug 2083858] [NEW] nova allows AZ's to be renamed if instances are shelved

2024-10-07 Thread sean mooney
Public bug reported:

downstream we had a bug report where live migration was failing on the
az filter after a customer renamed an az.

https://bugzilla.redhat.com/show_bug.cgi?id=2303395

nova does not support renaming AZs in general, or moving hosts with
instance between AZs

5 years ago as part of https://bugs.launchpad.net/nova/+bug/1378904 we made the 
API reject renmaes
when instance were on hosts
https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24

we have since closed another edgecase with
https://github.com/openstack/nova/commit/3c0eadae0b9ec48586087ea6c0c4e9176f0aa3bc

in both case we missed the fact that if an instance was pinned to an az
and then shelved it wont be considered as "on a host" so the safety
checks we added for updating the az metadata or adding/removing hosts
form an az does not account for shelved instances.

if a shelved instance is pinned to a host or az and you update the host
membership or update the az name or delete the az then it is possible to
unshleve the instance but the request spec will refer to the new
deleted/renamed az

it is only possible to unshelve today on master because we have removed
the az filter and when using placement the host aggregate az will not
have changed its uuid even if the az name has changes.

we have 2 potinetlay issue that we should fix.

when updating an az name we should check if its refenced in any request
spec for any non deleted instance.

when adding or removing a host from a host aggrate(with az metadata) we
should check if a request spec refers to the host and if it does is the
az in the request spec compatible.

this but is prmiarly for the first case as i was able to repodce that
with horizon.

the second case is speculation that i bleive could happen and we shoudl 
consider that when fixing.
either provide it can happen and address it in this bug or as a separate one or 
note that it is blocked
for some other reasons or otherwise out scope.

** Affects: nova
 Importance: Medium
 Status: Triaged


** Tags: availability-zones placement

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2083858

Title:
  nova allows AZ's to be renamed if instances are shelved

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  downstream we had a bug report where live migration was failing on the
  az filter after a customer renamed an az.

  https://bugzilla.redhat.com/show_bug.cgi?id=2303395

  nova does not support renaming AZs in general, or moving hosts with
  instance between AZs

  5 years ago as part of https://bugs.launchpad.net/nova/+bug/1378904 we made 
the API reject renmaes
  when instance were on hosts
  
https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24

  we have since closed another edgecase with
  
https://github.com/openstack/nova/commit/3c0eadae0b9ec48586087ea6c0c4e9176f0aa3bc

  in both case we missed the fact that if an instance was pinned to an
  az and then shelved it wont be considered as "on a host" so the safety
  checks we added for updating the az metadata or adding/removing hosts
  form an az does not account for shelved instances.

  if a shelved instance is pinned to a host or az and you update the
  host membership or update the az name or delete the az then it is
  possible to unshleve the instance but the request spec will refer to
  the new deleted/renamed az

  it is only possible to unshelve today on master because we have
  removed the az filter and when using placement the host aggregate az
  will not have changed its uuid even if the az name has changes.

  we have 2 potinetlay issue that we should fix.

  when updating an az name we should check if its refenced in any
  request spec for any non deleted instance.

  when adding or removing a host from a host aggrate(with az metadata)
  we should check if a request spec refers to the host and if it does is
  the az in the request spec compatible.

  this but is prmiarly for the first case as i was able to repodce that
  with horizon.

  the second case is speculation that i bleive could happen and we shoudl 
consider that when fixing.
  either provide it can happen and address it in this bug or as a separate one 
or note that it is blocked
  for some other reasons or otherwise out scope.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2083858/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2085124] [NEW] HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to

2024-10-21 Thread sean mooney
Public bug reported:

the flavor image conflict check for the virtio packed ring format
is not correctly converting the values to booleans when comparing them
as a result, the comparison is case sensitive when it should not be.

** Affects: nova
 Importance: Low
 Status: Triaged


** Tags: api libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2085124

Title:
  HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec
  explicitly set to True, conflicting with image which has
  hw_virtio_packed_ring explicitly set to true.

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  the flavor image conflict check for the virtio packed ring format
  is not correctly converting the values to booleans when comparing them
  as a result, the comparison is case sensitive when it should not be.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2085124/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2078669] Re: Specify --availability-zone=nova does not work since caracal

2024-10-17 Thread sean mooney
since the intorduction of AZ nova has documented that pinning to the
nova az should never be done

https://docs.openstack.org/nova/latest/admin/availability-zones.html

"""
The use of the default availability zone name in requests can be very 
error-prone. Since the user can see the list of availability zones, they have 
no way to know whether the default availability zone name (currently nova) is 
provided because a host belongs to an aggregate whose AZ metadata key is set to 
nova, or because there is at least one host not belonging to any aggregate. 
Consequently, it is highly recommended for users to never ever ask for booting 
an instance by specifying an explicit AZ named nova and for operators to never 
set the AZ metadata for an aggregate to nova. This can result is some problems 
due to the fact that the instance AZ information is explicitly attached to nova 
which could break further move operations when either the host is moved to 
another aggregate or when the user would like to migrate the instance.
"""

while it is possible to do it is generally considered unsupported and
incorrect by the nova core team.

hoizon has historically been the leading culpert to people actually
using nova in the request as it places it in the dropdown when creating
a VM.

this is a horizon bug and has never been a valid approach.

you do not need to specify an az when creating a VM in nova and you
should not in this case.

** Changed in: nova
   Status: In Progress => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2078669

Title:
  Specify --availability-zone=nova does not work since caracal

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  openstack server create with --availability-zone=nova can place
  instance onto the compute which belongs to non default AZ since
  caracal release. The issue is introduced by removal of nova
  availabilityZone filter
  https://review.opendev.org/c/openstack/nova/+/886779.

  When using placement to filter AZs we add member_of=XXX, where XXX is
  the aggregate UUID which corresponds to AZ. In case of default AZ
  (nova) there is no any aggregate. As result we request computes
  without any aggregate/az filtering.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2078669/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2076089] Re: admin cannot force instance launch on disabled host

2024-10-17 Thread sean mooney
The disable feature on the comptue service is intended to prevent any scudling 
of new workload to a disabled host.
this is intended to include new workload and all move operations to a disabled 
host.


the host is being rejected as intended so setting this to invalid as the 
expectations of the reporter do not match the intended semantics of the api.


** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2076089

Title:
  admin cannot force instance launch on disabled host

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===
  I have a set of disabled nova compute services, with nova compute service up 
and running, and I would like to force instance creation, as admin, on a 
disabled conpute node for testing purposes.

  I added the option --availability-zone nova:$HOST to the openstack
  server create command, however it fails with "no valid host found"
  even if it should have skipped placement filters.

  Steps to reproduce
  ==

  * openstack compute service list --service nova-compute
  
+--+--+--+--+--+---++
  | ID   | Binary   | Host | Zone | 
Status   | State | Updated At |
  
+--+--+--+--+--+---++
  | cdfe3225-a705-4c30-9f1b-a34be15d89a0 | nova-compute | test1-cg0001 | nova | 
disabled | down  | 2024-07-18T16:03:00.00 |
  | f44ad40d-b161-48b0-914a-738638dc10ea | nova-compute | test1-c0001  | nova | 
enabled  | up| 2024-08-05T09:57:08.00 |
  | 3e15725b-6b9d-44e9-ae03-fe121d75017c | nova-compute | test1-c0003  | nova | 
disabled | up| 2024-08-05T09:57:05.00 |
  
+--+--+--+--+--+---++

  * openstack server create --wait --flavor 016016 --boot-from-volume 20 
--image "Debian 12 (Switch Cloud)" --network my_private_network 
--availability-zone nova:test1-c0003 strider-force-launch
  Error creating server: strider-force-launch
  Error creating server

  Expected result
  ===
  Launch process should have skipped placement filters and instance should have 
been launched on requested hypervisor

  Actual result
  =
  * Failure reason is "No valid host found":
  openstack server show strider-force-launch
  
+-+---+
  | Field   | Value 
|
  
+-+---+
  | OS-DCF:diskConfig   | MANUAL
|
  | OS-EXT-AZ:availability_zone | nova  
|
  | OS-EXT-SRV-ATTR:host| None  
|
  | OS-EXT-SRV-ATTR:hypervisor_hostname | None  
|
  | OS-EXT-SRV-ATTR:instance_name   | instance-1256 
|
  | OS-EXT-STS:power_state  | NOSTATE   
|
  | OS-EXT-STS:task_state   | None  
|
  | OS-EXT-STS:vm_state | error 
|
  | OS-SRV-USG:launched_at  | None  
|
  | OS-SRV-USG:terminated_at| None  
|
  | accessIPv4  |   
|
  | accessIPv6  |   
|
  | addresses   |   

[Yahoo-eng-team] [Bug 1913016] Re: nova api os-resetState should not reset the state when VM is shelved_offloaded

2024-11-12 Thread sean mooney
** Changed in: nova
   Status: In Progress => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1913016

Title:
  nova api os-resetState should not reset the state when VM is
  shelved_offloaded

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  when the VM is in SHELVED_OFFLOADED state the VM doesn't exist
  physically on any compute node so resetting the state to active or
  error might cause the DB inconsistency and also make unshelving
  difficult.

  ~~~
  (overcloud) [stack@undercloud ~]$ nova list 
  
+--+---+---++-+--+
  | ID   | Name  | Status| Task 
State | Power State | Networks |
  
+--+---+---++-+--+
  | f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 | test2 | SHELVED_OFFLOADED | -
  | Shutdown| sriov-net1-197=10.74.167.185 |
  
+--+---+---++-+--+
  (overcloud) [stack@undercloud ~]$ openstack server set --state active test2
  (overcloud) [stack@undercloud ~]$ openstack server list 
  
+--+---++--+-+---+
  | ID   | Name  | Status | Networks
 | Image   | Flavor|
  
+--+---++--+-+---+
  | f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 | test2 | ACTIVE | 
sriov-net1-197=10.74.167.185 | rhel7.7 | m1-medium |
  
+--+---++--+-+---+
  (overcloud) [stack@undercloud ~]$ openstack server unshelve test2
  Cannot 'unshelve' instance f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 while it is 
in vm_state active (HTTP 409) (Request-ID: 
req-c992c5f5-63c9-4472-be75-9594bc682b37)
  ~~~

  Not just unshelve, we cannot perform any VM operation as VM doesn't
  exist anywhere.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1913016/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2086867] Re: [RFE] Allow SCP operations to use IP addresses during Nova migrations

2024-11-28 Thread sean mooney
with the merging of https://review.opendev.org/c/openstack/nova/+/909122
this has been resolved on master.


** Also affects: nova/2024.1
   Importance: Undecided
   Status: New

** Also affects: nova/2024.2
   Importance: Undecided
   Status: New

** Also affects: nova/bobcat
   Importance: Undecided
   Status: New

** Also affects: nova/2025.1
   Importance: Undecided
   Status: New

** Changed in: nova/2025.1
   Status: New => Fix Released

** Changed in: nova/2024.2
   Status: New => In Progress

** Changed in: nova/2025.1
   Importance: Undecided => Low

** Changed in: nova/2024.2
   Importance: Undecided => Low

** Changed in: nova/2025.1
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/2024.2
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/2024.1
   Importance: Undecided => Low

** Changed in: nova/2024.1
   Status: New => Triaged

** Changed in: nova/bobcat
   Importance: Undecided => Low

** Changed in: nova/bobcat
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2086867

Title:
  [RFE] Allow SCP operations to use IP addresses during Nova migrations

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
  Triaged
Status in OpenStack Compute (nova) 2024.2 series:
  In Progress
Status in OpenStack Compute (nova) 2025.1 series:
  Fix Released
Status in OpenStack Compute (nova) bobcat series:
  Triaged

Bug description:
  When DNS resolution is unavailable in the environment, Nova compute
  operations that rely on SCP transfers between compute nodes fail
  because of failed hostname resolution.

  Proposed solution is to add a new configuration option 
[libvirt]migrations_use_ip_to_scp that allows destination compute nodes to use 
source compute IP addresses instead of hostnames for SCP operations.
  When enabled, Nova will lookup the source compute's IP address from the 
database and use it for file transfers.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2086867/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1506127] [NEW] enable vhost-user support with neutron ovs agent

2015-10-14 Thread sean mooney
Public bug reported:

In the kilo cycle vhost-user support was added to nova and supported out of 
tree 
via the networking-ovs-dpdk ml2 driver and l2 agent on stack forge.

in liberty agent modification were up streamed to enable the standard 
neutron openvswitch agent to manage the netdev datapath.

in mitika it is desirable to remove all dependence on the networking-ovs-dpdk 
repo
and enable the standard ovs ml2 driver to support vhost-user on enabled 
vswitches.

to enable vhost-user support the following changes are proposed to the
neutron openvswitch agent and ml2 mechanism driver.

AGENT CHANGES:

To determine if a vswitch supports vhost user interface two pieces of 
information are required
the bridge datapath_type and the list of supported interfaces form the ovsdb.
the datapath_type feild is require to ensure the node is configured to used the 
dpdk enabled netdev datapath.

the supported interfaces types field in the ovsdb contains  a list of all 
supported interface types for all supported datapath_types.
 if the ovs-vswitchd process has been compiled with supported for dpdk 
interface but not started with dpdk enabled , dpdk interfaces will be omitted 
from this list.

the ovs neutron agent will be extended to query supported interfaces
parameter in the ovsdb and append it to the configuration section of the
agent state report.  the ovs neutron agent will be extended to append
the configured datapath_type to the  configuration section of the agent
state report. The OVS lib will be extended to retrieve the supported
interfaces from the ovsdb.

ML2 DRIVER CHANGES:

the ovs ml2 agent will be extended to consult the agent configuration
when selecting the vif type and vif binding details to install.

if the datapath is netdev and the supported interface types contains
vhost-user it will be enabled. in all other cases it will fall back to
the current behavior.  this mechanism will allow easy extension of the
ovs neutron agent to support other ovs interfaces type in the future if
enabled in nova.

** Affects: neutron
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
 Status: New


** Tags: rfe

** Changed in: neutron
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1506127

Title:
  enable vhost-user support with neutron ovs agent

Status in neutron:
  New

Bug description:
  In the kilo cycle vhost-user support was added to nova and supported out of 
tree 
  via the networking-ovs-dpdk ml2 driver and l2 agent on stack forge.

  in liberty agent modification were up streamed to enable the standard 
  neutron openvswitch agent to manage the netdev datapath.

  in mitika it is desirable to remove all dependence on the networking-ovs-dpdk 
repo
  and enable the standard ovs ml2 driver to support vhost-user on enabled 
vswitches.

  to enable vhost-user support the following changes are proposed to the
  neutron openvswitch agent and ml2 mechanism driver.

  AGENT CHANGES:

  To determine if a vswitch supports vhost user interface two pieces of 
information are required
  the bridge datapath_type and the list of supported interfaces form the ovsdb.
  the datapath_type feild is require to ensure the node is configured to used 
the dpdk enabled netdev datapath.

  the supported interfaces types field in the ovsdb contains  a list of all 
supported interface types for all supported datapath_types.
   if the ovs-vswitchd process has been compiled with supported for dpdk 
interface but not started with dpdk enabled , dpdk interfaces will be omitted 
from this list.

  the ovs neutron agent will be extended to query supported interfaces
  parameter in the ovsdb and append it to the configuration section of
  the agent state report.  the ovs neutron agent will be extended to
  append the configured datapath_type to the  configuration section of
  the agent state report. The OVS lib will be extended to retrieve the
  supported interfaces from the ovsdb.

  ML2 DRIVER CHANGES:

  the ovs ml2 agent will be extended to consult the agent configuration
  when selecting the vif type and vif binding details to install.

  if the datapath is netdev and the supported interface types contains
  vhost-user it will be enabled. in all other cases it will fall back to
  the current behavior.  this mechanism will allow easy extension of the
  ovs neutron agent to support other ovs interfaces type in the future
  if enabled in nova.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1506127/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1509184] [NEW] Enable openflow based dvr routing for east/west traffic

2015-10-22 Thread sean mooney
 vm ->  destination mac update, ttl decremented -> dest vm ( 
single openflow action)
- icmp from dest vm ->  destination mac update, ttl decremented -> source vm ( 
single openflow action)

other considerations:

- north/south
as ovs cannot lookup the destination mac dynamically via arp it is not 
possible to optimise the 
north/south path as described above.

- openvswich support
this mechanism is compatible with both kernel and dpdk ovs.
this mechanism requires nicira extensions for arp rewrite.
arp rewrite can be skipped for great support if required as it will fall 
back to  tap device and kernel.
icmp traffic for router interface will be handled by tap device as ovs 
currently does not 
support setting icmp type code via set_field or load openflow actions.

- performance
   performance of l3 routing is expected to approach l2 performance for 
east/west traffic.
   performance is not expected to change for north/south.

** Affects: neutron
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
     Status: New


** Tags: rfe

** Changed in: neutron
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1509184

Title:
  Enable openflow based dvr routing for east/west traffic

Status in neutron:
  New

Bug description:
  In the juno cycle dvr support was added to neutron do decentralise routing to 
the compute nodes.
  This  RFE bug proposes the introduction of a new dvr mode 
(dvr_local_openflow) to optimise the datapath
  of east/west traffic.

  ---High level 
description---
  The current implementation of DVR with ovs utilizes linux network namespaces 
to instantiate l3
  routers, the details of which are described here: 
http://docs.openstack.org/networking-guide/scenario_dvr_ovs.html

  fundamentally a neutron router comprises of 3 elements.
  - a router instance (network namespace)
  - a router interface (tap device)
  - a set or routing rules (kernel ip routes)

  In the special case of routing east/west traffic both the source and 
destination interfaces are known to neutron.
  because of that fact neutron contains all the information required to 
logically route traffic from its origin to its destination
  enabling the path to be established primitively. this proposal suggests 
moving the instantiation of the dvr local router from the kernel ip stack to 
Open vSwitch(ovs) for east/west traffic. 

  Open vSwitch provides a logical programmable interface (Openflow) to 
configure traffic forwarding and modification actions on arbitrary packet 
streams. When managed by the neutron openvswich l2 agent, ovs operates as a 
simple mac learning switch with limited utilisation of it programmable 
dataplane. to utilise ovs to create an l3 router the follow mappings from the 3 
fundamental elements can be made
  - a router instance (network namespace + a ovs bridge)
  - a router interface (tap device  + patch port pair)
  - a set or routing rules (kernel ip routes + openflow rules)

  background 
context-
  TL;DR 
  basic explanation of openflow/ovs briges and patch ports
  skip to implementation section if familiar.

  ovs implementation background:
  In openvswich at the control layer an ovs bridge is a unique logical domain 
of interfaces and flow rules.
  Similarly at the control layer a patch port pair is a logical entity that 
interconnects two bridges(or logical domains).

  From a dataplane perspective each ovs bridge is  first created as a separate 
instance of a dataplane.
  if these separate bridges/dataplanes are interconnected by patch ports, ovs  
will collapse the independent dataplanes into a single
  ovs dataplane instance. As a direct result of this implementation a logical 
topology of 1 bridge with two interfaces is realised in the dataplane level 
identically to 2 bridges each with 1 interface interconnected by path ports. 
This translate to zero dataplane overhead to the creation of multiple bridge 
allowing for arbitrary numbers of router instances to be created.

  Openflow capability background:
  The openflow protocol provides many capabilities which can be generally 
summarised as packet match criteria and actions to apply
  when the criteria is satisfied.  In the case of l3 routeing the match 
criteria of relevance are the Ethernet type and the destination ip 
address.similarly the openflow actions required are 
mod_dest,set_field,move,dec_ttl,output and drop.

  logical packet flow for a ping between two vms on same host:
  in the l2 case if a vm tries to ping another vm in the same subnet thre are 4 
stages. 
  - first it will send a broadcast arp packet to learn the mac address from the 
destination ip of th

[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky

2021-05-13 Thread sean mooney
** Also affects: nova/victoria
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/wallaby
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1815989

Title:
  OVS drops RARP packets by QEMU upon live-migration causes up to 40s
  ping pause in Rocky

Status in neutron:
  In Progress
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  New
Status in OpenStack Compute (nova) ussuri series:
  New
Status in OpenStack Compute (nova) victoria series:
  New
Status in OpenStack Compute (nova) wallaby series:
  New
Status in os-vif:
  Invalid

Bug description:
  This issue is well known, and there were previous attempts to fix it,
  like this one

  https://bugs.launchpad.net/neutron/+bug/1414559

  
  This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova 
libvirt and neutron ovs agent all run inside containers.

  So far the only simply fix I have is to increase the number of RARP
  packets QEMU sends after live-migration from 5 to 10. To be complete,
  the nova change (not merged) proposed in the above mentioned activity
  does not work.

  I am creating this ticket hoping to get an up-to-date (for Rockey and
  onwards) expert advise on how to fix in nova-neutron.

  
  For the record, below are the time stamps in my test between neutron ovs 
agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 
10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd 
last packet barely made through.

  openvswitch-agent.log:

  2019-02-14 19:00:13.568 73453 INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port
  57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {},
  'network_qos_policy_id': None, 'qos_policy_id': None,
  'allowed_address_pairs': [], 'admin_state_up': True, 'network_id':
  '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648,
  'fixed_ips': [

  {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': 
'10.0.1.4'}
  ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 
'mac_address': 'fa:16:3e:de:af:47', 'device': 
u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 
'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 
'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']}
  2019-02-14 19:00:13.568 73453 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan 
for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42

   
  tcpdump for rarp packets:

  [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev
  tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 
262144 bytes

  19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1747496] Re: MTUs are not set for VIFs if using kernel ovs + hybrid plug = false

2021-05-14 Thread sean mooney
** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1747496

Title:
  MTUs are not set for VIFs if using kernel ovs + hybrid plug = false

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===
  over the last few cyles supprot for mtu other then the default of 1500 has
  gernerally been improved in both nova and neutron.
  At the same time it was decided to remove the responsibility for VIF plugging 
from
  the virt drivers and centrailse it in os-vif.

  over the last few cycles os-vif has been enhanced to support setting the mtu 
on all codepaths
  and this work was completed in pike, however there are stills codepaths in 
the nova libvirt driver where os-vif is not used to plug the VIF and instead it 
is done by libvirt.

  when the VIF_TYPE is ovs and hybrid_plug=False libvirt plug the VM's
  VIFs iteself and os-vif is only responcible for creating the bridge it
  will be plugged into. in this case as the mtu is not set in teh
  libvirt xml and since os-vif is not respocible for pluggin the vif
  nothing set the mtu on the tap device that is added to ovs. This
  scenario arrises whenever libvirt is the nova virt driver and the no-
  op or openvswitch security group drivers are used.

  the end result is that in the vm the quest correctly recives the non
  default(e.g. jumbo frame) mtu form the neutron dhcp server and
  configures the mtu in its kernel but the mtu of the tap device added
  to the ovs bridge  is left at the default of 1500 preventing
  jumboframes from being used by the guest.

  Steps to reproduce
  ==
  using a host with a non default mtu

  deploy devstack normally useing libvirt + kvm/qemu
  and enable the openvsiwtch or no-op neutron security group driver

  
  [[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]]
  [securitygroup]
  firewall_driver = openvswitch

  or

  [[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]]
  [securitygroup]
  firewall_driver = noop

  spawn a singel vm via nova.

  and retrive the name of the interface for ovsdb or
  via virsh dumpxml.

  then run ifconfig  and check the mtu.

  note if openvsiwtch direver is used you will need to allow icmp/ssh
  in the security groups to be able to validate network conncetivity.

  Expected result
  ===
  tap should have same mtu as is set on neutron network.
  and a ping of max mtu e.g. ping -s 9000 ... for a network mtu of 9000 should 
work.

  Actual result
  =
  tap mtu will be 1500
  it is not possible to ping the vm with a packet larger then 

  Environment
  ===
  1. it was seen on pike but this effect all versions of openstack. 
 before the introduction of os-vif we did not support neutron network mtus
 and after we started to use os-vif we enable neutron mtu support only for
 the os-vif codepath so this never worked.

  2. Which hypervisor did you use?
 libvirt with kvm. this is not libvirt version
 specific as we do not generate the libvirt xml to set the mtu
 https://libvirt.org/formatdomain.html#mtu

  2. Which storage type did you use?
 N/a but i used ceph

  3. Which networking type did you use?
 neutron with kernel ovs and noop or openvsiwtch security group driver.
 not this will not happen with the iptables driver as that set 
hybrid_plug=True
 so os-vif is used to plug the VIF and it sets the mtu correctly.
 to

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1747496/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky

2021-05-17 Thread sean mooney
** Changed in: nova/train
   Status: Fix Released => New

** Also affects: neutron/ussuri
   Importance: Undecided
 Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
   Status: In Progress

** Also affects: neutron/wallaby
   Importance: Undecided
   Status: New

** Also affects: neutron/train
   Importance: Undecided
   Status: New

** Also affects: neutron/victoria
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1815989

Title:
  OVS drops RARP packets by QEMU upon live-migration causes up to 40s
  ping pause in Rocky

Status in neutron:
  In Progress
Status in neutron train series:
  New
Status in neutron ussuri series:
  In Progress
Status in neutron victoria series:
  New
Status in neutron wallaby series:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  New
Status in OpenStack Compute (nova) ussuri series:
  New
Status in OpenStack Compute (nova) victoria series:
  New
Status in OpenStack Compute (nova) wallaby series:
  New
Status in os-vif:
  Invalid

Bug description:
  This issue is well known, and there were previous attempts to fix it,
  like this one

  https://bugs.launchpad.net/neutron/+bug/1414559

  
  This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova 
libvirt and neutron ovs agent all run inside containers.

  So far the only simply fix I have is to increase the number of RARP
  packets QEMU sends after live-migration from 5 to 10. To be complete,
  the nova change (not merged) proposed in the above mentioned activity
  does not work.

  I am creating this ticket hoping to get an up-to-date (for Rockey and
  onwards) expert advise on how to fix in nova-neutron.

  
  For the record, below are the time stamps in my test between neutron ovs 
agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 
10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd 
last packet barely made through.

  openvswitch-agent.log:

  2019-02-14 19:00:13.568 73453 INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port
  57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {},
  'network_qos_policy_id': None, 'qos_policy_id': None,
  'allowed_address_pairs': [], 'admin_state_up': True, 'network_id':
  '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648,
  'fixed_ips': [

  {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': 
'10.0.1.4'}
  ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 
'mac_address': 'fa:16:3e:de:af:47', 'device': 
u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 
'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 
'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']}
  2019-02-14 19:00:13.568 73453 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan 
for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42

   
  tcpdump for rarp packets:

  [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev
  tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 
262144 bytes

  19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.n

[Yahoo-eng-team] [Bug 1930706] [NEW] nova allows suboptimal emulator tread pinning for realtime guests

2021-06-03 Thread sean mooney
Public bug reported:

today when ever you use a realtime guest you are required to enable cpu
pinning and other feature such as spcifing a real time core mask via
hw:cpu_realtime_mask or hw_cpu_realtime_mask.

in the victoria release this requriement was relaxed somewhat with the
intoduction of mixed cpu policy guest that are assinged pinned and
floating cores.

https://github.com/openstack/nova/commit/9fc63c764429c10f9041e6b53659e0cbd595bf6b


It is now possible to allocate all cores in an instance to realtime and
omit the ``hw:cpu_realtime_mask`` extra spec. This requires specifying the
``hw:emulator_threads_policy`` extra spec.

https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/releasenotes/notes/bug-1884231-16acf297d88b122e.yaml

however while that works well it also possible to hw:cpu_realtime_mask
but not specify hw:emulator_threads_policy which leads to sub optimal
xml generation for the libvirt driver.

this is reported downstream as
https://bugzilla.redhat.com/show_bug.cgi?id=1700390 for older releas
that predata the changes referenced above.

though in revaluation of this a possible improvment can be made as
detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1700390#c11


today if we have a 2 core vm where guest cpu 0 is non realtime and guest cpu 1 
is realtime we 
.e.g. hw:cpu_policy=dedicated hw:cpu_realtime=True hw:cpu_realtime_mask=^0
would generate the xml as follows
  
  
  
  

this is because the default behavior when no emulator_threads_policy is 
specifed is for the
emulator thread to float over all the vm cores.

but a slight modifcation to the xml could be made to have a more optimal 
default in this case
useing the cpu_realtime_mask we can instead restrict the emulator thread to 
float over the non realtime cores with realtime priortiy.

  
  
  
  
  

this will ensure that if qemu need to process a request for a device attach for 
example
that the emulator thread has higher priorty then the guest vcpus that deal with 
guest house keeping task but will not interupt the realtime cores.

this would give many of the benifits of emulator_threads_policy=share or
emulator_threads_policy=isolate without increase resource usage or
requireing any config,flavor or image changes. this should also be a
backporable solution to this problem.

this is espically important given realtime host often are deplopy with
the kernel isolcpus paramater which mean that the kernel will not load
balance the emulator thread acrros the range and will instead leave it
onthe core it intially spwaned on. today you coudl get lucky and it
could be spawn on core 0 in which case the new behvior would be the same
or it coudl get spwaned on core 1. wehn the emulatro thread is spawned
on core 1 sicne it has less priority then the vcpu thread it will only
run if the guest vcpu idels resulting in the iablity for qemu to process
device attach and other qemu monitor commands form libvirt or the user.

** Affects: nova
 Importance: Wishlist
 Status: Triaged


** Tags: libvirt numa

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1930706

Title:
  nova allows suboptimal emulator tread pinning for realtime guests

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  today when ever you use a realtime guest you are required to enable
  cpu pinning and other feature such as spcifing a real time core mask
  via hw:cpu_realtime_mask or hw_cpu_realtime_mask.

  in the victoria release this requriement was relaxed somewhat with the
  intoduction of mixed cpu policy guest that are assinged pinned and
  floating cores.

  
https://github.com/openstack/nova/commit/9fc63c764429c10f9041e6b53659e0cbd595bf6b

  
  It is now possible to allocate all cores in an instance to realtime and
  omit the ``hw:cpu_realtime_mask`` extra spec. This requires specifying the
  ``hw:emulator_threads_policy`` extra spec.

  
https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/releasenotes/notes/bug-1884231-16acf297d88b122e.yaml

  however while that works well it also possible to hw:cpu_realtime_mask
  but not specify hw:emulator_threads_policy which leads to sub optimal
  xml generation for the libvirt driver.

  this is reported downstream as
  https://bugzilla.redhat.com/show_bug.cgi?id=1700390 for older releas
  that predata the changes referenced above.

  though in revaluation of this a possible improvment can be made as
  detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1700390#c11

  
  today if we have a 2 core vm where guest cpu 0 is non realtime and guest cpu 
1 is realtime we 
  .e.g. hw:cpu_policy=dedicated hw:cpu_realtime=True hw:cpu_realtime_mask=^0
  would generate the xml as follows





  this is because the default behavior when no emulator_threads_policy is 
specifed is for the
  emulator thread to float o

[Yahoo-eng-team] [Bug 1929446] Re: OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads

2021-06-15 Thread sean mooney
setting to invalid for nova as the error is in the ovs python bindings.
marked as triaged for os-vif to track the enhancements proposed in comment 3 
above.

** Also affects: ovsdbapp
   Importance: Undecided
   Status: New

** Changed in: os-vif
   Status: New => Triaged

** Changed in: os-vif
   Importance: Undecided => Medium

** Changed in: nova
   Status: Triaged => Invalid

** Changed in: os-vif
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1929446

Title:
  OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads

Status in OpenStack Compute (nova):
  Invalid
Status in os-vif:
  Triaged
Status in ovsdbapp:
  New

Bug description:
  I've been seeing lots of failures caused by timeouts in
  test_volume_backed_live_migration during the live-migration and
  multinode grenade jobs, for example:

  
https://zuul.opendev.org/t/openstack/build/bb6fd21b5d8c471a89f4f6598aa84e5d/logs

  During check_can_live_migrate_source I'm seeing the following gap in
  the logs that I can't explain:

  12225 May 24 10:23:02.637600 ubuntu-focal-inap-mtl01-0024794054 
nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None 
req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 
tempest-LiveMigrationTest-312230369-project-admin] [instance: 
91a0e0ca-e6a8-43ab-8e68-a10a77ad615b] Check if temp file 
/opt/stack/data/nova/instances/tmp5lcmhuri exists to indicate shared storage is 
being used for migration. Exists? False {{(pid=107012) 
_check_shared_storage_test_file 
/opt/stack/nova/nova/virt/libvirt/driver.py:9367}}
  [..]
  12282 May 24 10:24:22.385187 ubuntu-focal-inap-mtl01-0024794054 
nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None 
req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 
tempest-LiveMigrationTest-312230369-project-admin] skipping disk /dev/sdb (vda) 
as it is a volume {{(pid=107012) _get_instance_disk_info_from_config 
/opt/stack/nova/nova/virt/libvirt/driver.py:10458}}

  ^ this leads to both the HTTP request to live migrate (that's still a
  synchronous call at this point [1]) *and* the RPC call from the dest
  to the source both timing out.

  [1] https://docs.openstack.org/nova/latest/reference/live-
  migration.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1929446/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1933097] [NEW] libivrt machine types are case sensitive but we do not validate them in nova

2021-06-21 Thread sean mooney
Public bug reported:

as seen in http://paste.openstack.org/show/806818/ if we use machine type "Q35" 
instead of "q35" we fail to boot the vm. This is because the machine type name 
in libvirt are case sensitvie.
however due to the way libvirt validates the xml it retruns a "No PCI buses 
available" error instead of a "incorrect machine type error" or similar that 
would be more intunitve.

021-06-20 02:37:39.795 7 ERROR nova.virt.libvirt.guest 
[req-04cb6169-bee4-407d-aef1-2e22abfccf97 329bf2535969456cb83fbc8e338ecb4c 
5f3ea501afce4858b43186166d4d7afb - default default] Error defining a guest with 
XML: 
  e2f47fae-7684-4f23-9f3e-39a6b133f929
  instance-0006
  4194304
  4
  
http://openstack.org/xmlns/libvirt/nova/1.1";>
  
  test
  2021-06-20 02:37:39
  
4096
10
0
0
4
  
  
sean
sean
  
  
  

  

  

  
  

  OpenStack Foundation
  OpenStack Nova
  23.0.2
  e2f47fae-7684-4f23-9f3e-39a6b133f929
  e2f47fae-7684-4f23-9f3e-39a6b133f929
  Virtual Machine

  
  
hvm


  
  


  
  
4096
  
  



  
  

  
  

  
  

  
  

  
  
  



  
  
  
  
  
  

  


  



  



  /dev/urandom



  

  

: libvirt.libvirtError: XML error: No PCI buses available


since the libvirt machine types are case sensitive we cannot assume we can just 
lowercase teh users input but we should still be able to normalise the machine 
types in the following way.


on startup we call virsh capablites to retrive info from libvirt regarding the 
capablities of the host.from that api we can retrieve the set of supported 
machine types.
we can then construct a dictionary of lower-case machine type name to correct 
case machine type names.

when booting a vm we should lowercase the user input and lookup the
correct case form this dictonary. this will allow nova to continue to
treat the input as case insensitive but still  pass the correct value to
libvirt.

** Affects: nova
 Importance: Low
 Status: Triaged


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1933097

Title:
  libivrt machine types are case sensitive but we do not validate them
  in nova

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  as seen in http://paste.openstack.org/show/806818/ if we use machine type 
"Q35" instead of "q35" we fail to boot the vm. This is because the machine type 
name in libvirt are case sensitvie.
  however due to the way libvirt validates the xml it retruns a "No PCI buses 
available" error instead of a "incorrect machine type error" or similar that 
would be more intunitve.

  021-06-20 02:37:39.795 7 ERROR nova.virt.libvirt.guest 
[req-04cb6169-bee4-407d-aef1-2e22abfccf97 329bf2535969456cb83fbc8e338ecb4c 
5f3ea501afce4858b43186166d4d7afb - default default] Error defining a guest with 
XML: 
e2f47fae-7684-4f23-9f3e-39a6b133f929
instance-0006
4194304
4

  http://openstack.org/xmlns/libvirt/nova/1.1";>

test
2021-06-20 02:37:39

  4096
  10
  0
  0
  4


  sean
  sean



  

  

  


  
OpenStack Foundation
OpenStack Nova
23.0.2
e2f47fae-7684-4f23-9f3e-39a6b133f929
e2f47fae-7684-4f23-9f3e-39a6b133f929
Virtual Machine
  


  hvm
  
  


  
  


  4096


  
  
  


  


  


  


  



  
  
  






  

  
  

  
  
  

  
  
  
/dev/urandom
  
  
  

  

  
  : libvirt.libvirtError: XML error: No PCI buses available

  
  since the libvirt machine types are case sensitive we cannot assume we can 
just lowercase teh users input but we should still be able to normalise the 
machine types in the following way.

  
  on startup we call virsh capablites to retrive info from libvirt regarding 
the capablities of the host.from that api we can retrieve the set of supported 
machine types.
  we can then construct a dictionary of lower-case machine type name to correct 
case machine type names.

  when booting a vm we should lowercase the user input and lookup the
  correct case form this dictonary. this will allow nova to continue to
  treat the input as c

[Yahoo-eng-team] [Bug 1933517] Re: [RFE][OVN] Create an intermediate OVS bridge between VM and intergration bridge to improve the live-migration process

2021-07-01 Thread sean mooney
** Also affects: os-vif
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1933517

Title:
  [RFE][OVN] Create an intermediate OVS bridge between VM and
  intergration bridge to improve the live-migration process

Status in neutron:
  New
Status in os-vif:
  New

Bug description:
  When live migrating network sensitive VMs, the communication is
  broken.

  This is similar to [1] but in OVN the vif-plugged events are directly
  controller by the Neutron server, not by the OVS/DHCP agents.

  The problem lies in when the destination chassis creates the needed OF
  rules for the destination VM port. Same as in OVS, the VM port is
  created when the instance is unpaused. At this moment the VM continues
  sending packets through the interface but OVN didn't finish the
  configuration.

  Related BZs:
  - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1903653
  - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1872937
  - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1966512

  [1]https://bugs.launchpad.net/neutron/+bug/1901707

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1933517/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1860312] Re: compute service failed to delete

2021-07-28 Thread sean mooney
actully the operator woudl be deleting the comptue service after removing the 
compute nodes.
you shoudl remove the compute service first but we shoudl fix this regardless.

you should be able to recreate this bug by just creating a compute servce
and then deleteing it.

** Changed in: nova
   Status: Expired => Triaged

** Changed in: nova
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1860312

Title:
  compute service failed to delete

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Description
  ===
  I deployed openstack with openstack-helm on kubernetes.When one of the 
nova-compute service(driver=ironic replica of the deployment is 1) breakdown.It 
may be scheduled to another node by kubernetes.When I try to delete the old 
compute service(status down), it failed.

  Steps to reproduce
  ==
  Firstly, openstack was deployed in kubernetes cluster, and the replica of the 
nova-compute-ironic is 1.
  * I deleted the pod nova-compute-ironic-x
  * then wait for the new pod to start
  * then exec openstack compute service list, there will be two compute service 
for ironic, the status of the old one would be down.
  * then I try to delete the old compute service

  Expected result
  ===
  the old compute service could be deleted successfully

  Actual result
  =
  failed to delete, and returned an http 500

  Environment
  ===
  1. Exact version of OpenStack you are running. See the following
 18.2.2, rocky

  2. Which hypervisor did you use?
 Libvirt + KVM

  2. Which storage type did you use?
 ceph

  3. Which networking type did you use?
 Neutron with OpenVSwitch

  Logs & Configs
  ==
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi 
[req-922cc601-9aa1-4c3d-ad9c-71f73a341c28 40e7b8c3d59943e08a52acd24fe30652 
d13f1690c08d41ac854d720ea510a710 - default default] Unexpected exception in API 
method: ComputeHostNotFound: Compute host mgt-slave03 could not be found.
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi Traceback (most 
recent call last):
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/wsgi.py",
 line 801, in wrapped
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, 
**kwargs)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/compute/services.py",
 line 252, in delete
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi context, 
service.host)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py",
 line 184, in wrapper
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi result = fn(cls, 
context, *args, **kwargs)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py",
 line 443, in get_all_by_host
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi 
use_slave=use_slave)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py",
 line 213, in wrapper
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, 
**kwargs)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py",
 line 438, in _db_compute_node_get_all_by_host
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return 
db.compute_node_get_all_by_host(context, host)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/api.py", line 
291, in compute_node_get_all_by_host
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return 
IMPL.compute_node_get_all_by_host(context, host)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py",
 line 258, in wrapped
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(context, 
*args, **kwargs)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi   File 
"/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py",
 line 659, in compute_node_get_all_by_host
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi raise 
exception.ComputeHostNotFound(host=host)
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi ComputeHostNotFound: 
Compute host mgt-slave03 could not be found.
  2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi 
  2020-01-20 06:44:53.480 1 

[Yahoo-eng-team] [Bug 1934742] Re: nova may leak net interface in guest if port under attaching/deleting

2021-08-04 Thread sean mooney
** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1934742

Title:
  nova may leak net interface in guest if port under attaching/deleting

Status in neutron:
  New
Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Description
  ===

  It seems that nova may leak network interface in guest
  if a port deletion is run in the middle of the a port attachment

  in compute manager, attach_interface run atomically
  the following tasks:
  -update port in neutron(Binding)
  -...
  -driver.attach_interface()
  -update net_info_cache
  -...

  When a Bound port is deleted, nova receive an event
  "network-vif-deleted" and process it by running
  def _process_instance_vif_deleted_event()
   
   driver.detach_interface()

  if this event processing is done just after port binding
  and before driver.attach_interface() of an
  ongoing interface attachment of the same port,
  nova will attach the deleted orphan interface to guest

  Probably, the this event processing must be synchronized
  with compute manager method attach_interface/detach_interface.

  
  Steps to reproduce
  ==

  on master devstack:

  $openstack server create --flavor m1.small --image cirros-0.5.2-x86_64-disk \
  --nic net-id=private myvm
  $openstack port create --network private myport

  # For ease of reproduction add a pause just before
  driver.attach_interface():

  nova/compute/manager.py:
  def attach_interface()
   try:
 time.sleep(8)
 self.driver.attach_interface(context, ...)

  $sudo service devstack@n-cpu restart

  $openstack server add port myvm myport &
  $sleep 4 ; openstack port delete  myport
  [1]+  Exit 1  openstack server add port myvm myport
  Port id 3d47bceb-34ef-4002-8e33-30957127a87f could not be found. (HTTP 404) 
(Request-ID: req-6c056ad3-1e61-4102-9e5e-48cdd4dffc43)

  $ nova interface-list alex
  
++--+--+---+---+-+
  | Port State | Port ID  | Net ID  
 | IP addresses  | MAC Addr 
 | Tag |
  
++--+--+---+---+-+
  | ACTIVE | 0fe9365b-5747-4532-be50-e6362b10b645 | 
d8f03257-d1e2-4488-bc42-0e189481a6c7 | 
10.0.0.49,fde5:2b4:b028:0:f816:3eff:feb8:f14c | fa:16:3e:b8:f1:4c | -   |
  
++--+--+---+---+-+

  $ virsh domiflist instance-0001
   InterfaceType Source   ModelMAC
  
   tap0fe9365b-57   bridge   br-int   virtio   fa:16:3e:b8:f1:4c
   tapdcbbae72-0b   bridge   br-int   virtio   fa:16:3e:95:91:25

  
  Expected result
  ===
  interface should not be attached to guest

  Actual result
  =
  zombie interface is attached to guest

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1934742/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1940555] Re: Compute Component: Error: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist")

2021-08-23 Thread sean mooney
** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => Triaged

** Changed in: nova
   Importance: Undecided => Critical

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1940555

Title:
  Compute Component: Error: (pymysql.err.ProgrammingError) (1146, "Table
  'nova_api.cell_mappings' doesn't exist")

Status in OpenStack Compute (nova):
  Triaged
Status in tripleo:
  Triaged

Bug description:
  https://logserver.rdoproject.org/openstack-component-
  compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-
  centos-8-standalone-compute-
  
master/7dac4e0/logs/undercloud/var/log/extra/podman/containers/nova_db_sync/stdout.log.txt.gz

  Is [api_database]/connection set in nova.conf?
  Is the cell0 database connection URL correct?
  Error: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' 
doesn't exist")
  [SQL: SELECT cell_mappings.created_at AS cell_mappings_created_at, 
cell_mappings.updated_at AS cell_mappings_updated_at, cell_mappings.id AS 
cell_mappings_id, cell_mappings.uuid AS cell_mappings_uuid, cell_mappings.name 
AS cell_mappings_name, cell_mappings.transport_url AS 
cell_mappings_transport_url, cell_mappings.database_connection AS 
cell_mappings_database_connection, cell_mappings.disabled AS 
cell_mappings_disabled 
  FROM cell_mappings 
  WHERE cell_mappings.uuid = %(uuid_1)s 
   LIMIT %(param_1)s]
  [parameters: {'uuid_1': '----', 'param_1': 1}]
  (Background on this error at: http://sqlalche.me/e/14/f405)

  
  
https://logserver.rdoproject.org/openstack-component-compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-compute-master/7dac4e0/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

  + echo 'Running command: '\''/usr/bin/bootstrap_host_exec nova_conductor su 
nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''\'''
  + exec /usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c 
''\''/usr/bin/nova-manage' db sync \'
  2021-08-19 08:17:33.982762 | fa163e06-c6d2-5dfd-0459-197e |  
FATAL | Create containers managed by Podman for 
/var/lib/tripleo-config/container-startup-config/step_3 | standalone | 
error={"changed": false, "msg": "Failed containers: nova_api_db_sync, 
nova_api_map_cell0, nova_api_ensure_default_cell, nova_db_sync"}
  2021-08-19 08:17:33.983320 | fa163e06-c6d2-5dfd-0459-197e | 
TIMING | tripleo_container_manage : Create containers managed by Podman for 
/var/lib/tripleo-config/container-startup-config/step_3 | standalone | 
0:19:23.159835 | 41.20s

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1940555/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1944111] Re: Missing __init__.py in nova/db/api

2021-09-20 Thread sean mooney
setting to critical since this blocks packaging of xena

** Also affects: nova/xena
   Importance: Critical
   Status: In Progress

** Also affects: nova/yoga
   Importance: Undecided
   Status: New

** Changed in: nova/yoga
   Status: New => In Progress

** Changed in: nova/yoga
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1944111

Title:
  Missing __init__.py in nova/db/api

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) xena series:
  In Progress
Status in OpenStack Compute (nova) yoga series:
  In Progress

Bug description:
  Looks like nova/db/api is missing an __init__.py, which breaks *at
  least* my Debian packaging.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1944111/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1944083] Re: Nova assumptions about /32 routes to NS' break name resolution under DHCP

2021-09-21 Thread sean mooney
im not sure that nova is incontrol of this.
this seams like a issue likely with dhcp?

i dont think nova actully set /32 routes for the gateways itself.

** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1944083

Title:
  Nova assumptions about /32 routes to NS' break name resolution under
  DHCP

Status in neutron:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  We run designate out of a private VLAN which is accessible via one of the two 
external networks in our wallaby cloud. In order to permit instances name 
resolution via those endpoints, we add a route to the subnet in that private 
VLAN via the 2nd router added to each network, the external network of which is 
our OutsidePrivate net (the External network which resides inside the DC, vs 
our OutsidePublic which is a VLAN to the actual WAN).
  Unfortunately, despite setting up this 2nd router and explicit route, we see 
nova instances coming up with an explicit /32 route to each DNS server 
specified _via the .1 gateway_ in the network which is the router to 
OutsidePrivate, and despite an explicit route to the /24 (i know CIDR works in 
smallest subnet preference) which should be understood to encapsulate the 3 IPs 
of the NS' themselves and prevent the /32 routes from being created.
  Even setting explicit /32 routes to each NS via the 2nd gateway @ .2 doesn't 
work - the original /32's via the .1 are still present, and the only fix we've 
found is to force nodes to static addressing and routing via cloud-init. ICMP 
redirect from the primary gateway to the secondary is hit-or-miss, and not how 
this should work anyway.
  I've not found anything in the docs about how these default routes via the 
primary gateway are set up, and have therefore found no way to disable them so 
filing this a bug since it's a major impediment to anyone resolving names via 
any gateway but the one set as the default gateway for the network.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1944083/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1945646] Re: Nova fails to live migrate instance with upper-case port MAC

2021-10-07 Thread sean mooney
adding neutron as i think neutron shoudl also be normaliasing the mac
adress that users provide and alwasy storign it in lower case. a mac is
technically a number not a string  we just use hex encoding for human
readablity so the caseing does not matter but it would be nice to at
least consider moving this normalisation to the neutron api/db to avoid
this problem.

** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1945646

Title:
  Nova fails to live migrate instance with upper-case port MAC

Status in neutron:
  New
Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Description
  ===

  When neutron port has MAC address defined in upper case and libvirt stores 
MAC in XML in lower case, migration is failed with KeyError:
  ```
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: 2021-09-30 
10:31:38.028 3054313 ERROR nova.virt.libvirt.driver 
[req-911a4b70-5448-48a1-afa4-1bbd0b38737b - - - - -] [instance: 75f7
  9d85-6505-486c-bc34-e78fd6350a77] Live Migration failure: '00:50:56:af:e1:73'
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: Traceback (most 
recent call last):
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/hubs/hub.py",
 line 461, in fire_timers
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: timer()
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/hubs/timer.py",
 line 59, in __call__
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: cb(*args, **kw)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/event.py", 
line 175, in _do_send
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: 
waiter.switch(result)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/greenthread.py",
 line 221, in main
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: result = 
function(*args, **kwargs)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/utils.py", line 
661, in context_wrapper
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: return 
func(*args, **kwargs)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/driver.py",
 line 9196, in _live_migration_operation
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: LOG.error("Live 
Migration failure: %s", e, instance=instance)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/oslo_utils/excutils.py",
 line 220, in __exit__
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: 
self.force_reraise()
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/oslo_utils/excutils.py",
 line 196, in force_reraise
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: 
six.reraise(self.type_, self.value, self.tb)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/six.py", line 703, in 
reraise
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: raise value
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/driver.py",
 line 9152, in _live_migration_operation
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: new_xml_str = 
libvirt_migrate.get_updated_guest_xml(
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/migration.py",
 line 65, in get_updated_guest_xml
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: xml_doc = 
_update_vif_xml(xml_doc, migrate_data, get_vif_config)
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]:   File 
"/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/migration.py",
 line 355, in _update_vif_xml
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: migrate_vif = 
migrate_vif_by_mac[mac_addr]
  Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: KeyError: 
'00:50:56:af:e1:73'
  ```

  Environment
  ===

  Ubuntu 20.04
  Libvirt 6.0.0-0ubuntu8.14
  Nova 22.2.3.dev2 (sha 4ce01d6c49f81b6b2438549b01a89ea1b5956320)
  Neutron with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1945646/+subscriptions


-- 
Mailing lis

[Yahoo-eng-team] [Bug 1943969] Re: Unable to use shared security groups for VM creation

2021-11-15 Thread sean mooney
This is an RFE not a bug.
This should be addressed via a specless blueprint as it is a new capablity.

** Changed in: nova
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1943969

Title:
  Unable to use shared security groups for VM creation

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===
  Nova does not support shared security groups for new virtual mashines. It 
happens because Nova filters security groups by tenant ID here 
https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L813

  Steps to reproduce
  ==

  * create two projects A and B
  * in project A create security group in Neutron
  * share the security group to project B via RBAC 
(https://docs.openstack.org/neutron/latest/admin/config-rbac.html#sharing-a-security-group-with-specific-projects)
  * try to create VM with this security group in project B

  Expected result
  ===

  The VM should be created if security group shared to this project.

  Actual result
  =

  The error in logs:

  Traceback (most recent call last):
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", 
line 2510, in _build_resources
  yield resources
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", 
line 2271, in _build_and_run_instance
  block_device_info=block_device_info)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/driver.py",
 line 505, in spawn
  admin_password, network_info, block_device_info)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py",
 line 1175, in spawn
  vm_folder)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py",
 line 342, in build_virtual_machine
  vm_name=vm_name)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py",
 line 311, in _get_vm_config_spec
  network_info)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vif.py",
 line 187, in get_vif_info
  for vif in network_info:
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", 
line 585, in __iter__
  return self._sync_wrapper(fn, *args, **kwargs)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", 
line 576, in _sync_wrapper
  self.wait()
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", 
line 608, in wait
  self[:] = self._gt.wait()
File 
"/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenthread.py", line 
175, in wait
  return self._exit_event.wait()
File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/event.py", 
line 125, in wait
  current.throw(*self._exc)
File 
"/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenthread.py", line 
214, in main
  result = function(*args, **kwargs)
File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/utils.py", 
line 828, in context_wrapper
  return func(*args, **kwargs)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", 
line 1656, in _allocate_network_async
  six.reraise(*exc_info)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", 
line 1639, in _allocate_network_async
  bind_host_id=bind_host_id)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/neutronv2/api.py",
 line 1043, in allocate_for_instance
  instance, neutron, security_groups)
File 
"/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/neutronv2/api.py",
 line 830, in _process_security_groups
  security_group_id=security_group)
  SecurityGroupNotFound: Security group 0c649378-1cf8-48e0-9eb4-b72772c35a62 
not found.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1943969/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1954427] Re: nova-ceph-multistore job fails permanently with: Cannot uninstall 'logutils'

2021-12-10 Thread sean mooney
** Also affects: devstack-plugin-ceph
   Importance: Undecided
   Status: New

** Changed in: devstack
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1954427

Title:
  nova-ceph-multistore job fails permanently with: Cannot uninstall
  'logutils'

Status in devstack:
  Invalid
Status in devstack-plugin-ceph:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  The last 4 run[2] failed with the same issue[1]:

  2021-12-10 10:42:59.793429 | controller |   Attempting uninstall:
  logutils

  2021-12-10 10:42:59.793490 | controller | Found existing
  installation: logutils 0.3.3

  2021-12-10 10:42:59.793500 | controller | ERROR: Cannot uninstall
  'logutils'. It is a distutils installed project and thus we cannot
  accurately determine which files belong to it which would lead to only
  a partial uninstall.

  2021-12-10 10:43:00.083297 | controller | + inc/python:pip_install:1
  :   exit_trap

  [1] 
https://zuul.opendev.org/t/openstack/build/722c6caf8e454849b897a43bcf617dd2/log/job-output.txt#9419
  [2] 
https://zuul.opendev.org/t/openstack/builds?job_name=nova-ceph-multistore&project=openstack/nova

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1954427/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1959682] Re: String concatenation TypeError in resize flavor helper

2022-02-01 Thread sean mooney
setting this to invilad since the bug is in tempest.
it is currently blocking the nova-next job so it is a gate-blocker for nova 
until this is fixed in tempest.

as there are already 2 patches up to fix this we expect it to be
resolved soon so we will just clost the nova part for now.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1959682

Title:
  String concatenation TypeError in resize flavor helper

Status in OpenStack Compute (nova):
  Invalid
Status in tempest:
  In Progress

Bug description:
  In cae966812, for certain resize tests, we started adding a numeric ID
  to the new flavor name to avoid collisions. This was incorrectly done
  as a string + int concatenation, which is raising a `TypeError: can
  only concatenate str (not "int") to str`.

  Example of this happening in nova-next job:
  
https://zuul.opendev.org/t/openstack/build/7f750faf22ec48219ddd072cfe6e02e1/logs

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1959682/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1960247] Re: server suspend action allows authorization by user_id while server resume action does not

2022-02-07 Thread sean mooney
ack i kind of agree with gmann here
gmann is correct that this does not align with the direction we are moving in 
with our new policy/rbac work and that our intent was to eventually remove it 
outside of keypairs.

the spec linked above clearly state what our intentions were and the
enpoint on which it could be used. as such I'm going to update this to
invalid but we can continue this conversation on the mailing list, irc
or in the nova team meeting.

** Changed in: nova
   Status: In Progress => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1960247

Title:
  server suspend action allows authorization by user_id while server
  resume action does not

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Description
  ===
  Since the following change was merged, nova allows authorization by user_id 
for server suspend action.

  https://review.opendev.org/c/openstack/nova/+/353344

  However the same is not yet implemented in resume action and this
  results in inconsistent policy rule for corresponding two operations.

  Steps to reproduce
  ==
  * Define policy rules like the following example
    "os_compute_api:os-suspend-server:suspend": "rule:admin_api or 
user_id:%(user_id)s"
    "os_compute_api:os-suspend-server:resume": "rule:admin_api or 
user_id:%(user_id)s"
  * Create a server by a non-admin user
  * Suspend the server by the user
  * Resume the server by the user

  Expected result
  ===
  Both suspend and resume are accepted

  Actual result
  =
  Only suspend is accepted and resume fails with

  ERROR (Forbidden): Policy doesn't allow os_compute_api:os-suspend-
  server:suspend to be performed. (HTTP 403) (Request-ID: req-...)

  Environment
  ===
  This issue was initially reported as one found in stable/xena deployment.
   
http://lists.openstack.org/pipermail/openstack-discuss/2022-February/027078.html

  Logs & Configs
  ==
  N/A

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1960247/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1964149] [NEW] nova dns lookups can block the nova api process leading to 503 errors.

2022-03-08 Thread sean mooney
Public bug reported:

we currently have 4 possibly related downstream bugs whereby DNS lookups can
result in 503 errors as we do not monkey patch green DNS and that can result in 
blocking behavior.

specifically we have seen callses to  socket.getaddrinfo in py-amqp block the 
API
when using ipv6.

https://bugzilla.redhat.com/show_bug.cgi?id=2037690
https://bugzilla.redhat.com/show_bug.cgi?id=2050867
https://bugzilla.redhat.com/show_bug.cgi?id=2051631
https://bugzilla.redhat.com/show_bug.cgi?id=2056504


copying  a summary of the rca 

from one of the bugs

What happens:

- A request comes in which requires rpc, so a new connection to rabbitmq
is to be established

- The hostname(s) from the transport_url setting are ultimately passed
to py-amqp, which attempts to resolve the hostname to an ip address so
it can set up the underlying socket and connect

- py-amqp explicitly tries to resolve with AF_INET first and then only
if that fails, then it tries with AF_INET6[1]

- The customer environment is primarily IPv6.  Attempting to resolve the
hostname via AF_INET fails nss_hosts (the /etc/hosts file only have IPv6
addrs), and falls through to nss_dns

- Something about the customer DNS infrastructure is slow, so it takes a
long time (~10 seconds) for this IPv4-lookup to fail.

- py-amqp finally tries with AF_INET6 and the hostname is resolved
immediately via nss_hosts because the entry is in the /etc/hosts


Critically, because nova explicitly disables greendns[2] with eventlet, the 
*entire* nova-api worker is blocked during the duration of the slow name 
resolution, because socket.getaddrinfo is a blocking call into glibc.

[1] 
https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208
[2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36


nova currently disables greendns monkeypatch because of a very old bug on 
centos 6 on python 2.6 and the havana release of nova 
https://bugs.launchpad.net/nova/+bug/1164822

ipv6 support was added in  v0.17 in the same release that added python 3 
support back in 2015
https://github.com/eventlet/eventlet/issues/8#issuecomment-75490457

so we should not need to work around the lack of ipv6 support anymore.
https://review.opendev.org/c/openstack/nova/+/830966

** Affects: nova
 Importance: Medium
 Assignee: sean mooney (sean-k-mooney)
 Status: Triaged


** Tags: api yoga-rc-potential

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
   Status: New => Triaged

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Tags added: api yoga-rc-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1964149

Title:
  nova dns lookups can block the nova api process leading to 503 errors.

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  we currently have 4 possibly related downstream bugs whereby DNS lookups can
  result in 503 errors as we do not monkey patch green DNS and that can result 
in blocking behavior.

  specifically we have seen callses to  socket.getaddrinfo in py-amqp block the 
API
  when using ipv6.

  https://bugzilla.redhat.com/show_bug.cgi?id=2037690
  https://bugzilla.redhat.com/show_bug.cgi?id=2050867
  https://bugzilla.redhat.com/show_bug.cgi?id=2051631
  https://bugzilla.redhat.com/show_bug.cgi?id=2056504

  
  copying  a summary of the rca 

  from one of the bugs

  What happens:

  - A request comes in which requires rpc, so a new connection to
  rabbitmq is to be established

  - The hostname(s) from the transport_url setting are ultimately passed
  to py-amqp, which attempts to resolve the hostname to an ip address so
  it can set up the underlying socket and connect

  - py-amqp explicitly tries to resolve with AF_INET first and then only
  if that fails, then it tries with AF_INET6[1]

  - The customer environment is primarily IPv6.  Attempting to resolve
  the hostname via AF_INET fails nss_hosts (the /etc/hosts file only
  have IPv6 addrs), and falls through to nss_dns

  - Something about the customer DNS infrastructure is slow, so it takes
  a long time (~10 seconds) for this IPv4-lookup to fail.

  - py-amqp finally tries with AF_INET6 and the hostname is resolved
  immediately via nss_hosts because the entry is in the /etc/hosts

  
  Critically, because nova explicitly disables greendns[2] with eventlet, the 
*entire* nova-api worker is blocked during the duration of the slow name 
resolution, because socket.getaddrinfo is a blocking call into glibc.

  [1] 
https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208
  [2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36

  
  nova currently disables greendns monkeypatch because of a very old bug on 
centos 6 on pytho

[Yahoo-eng-team] [Bug 1801919] Re: brctl is obsolete use ip

2019-02-25 Thread sean mooney
its not released yet but ill be releaseing os-vif today.
it looks like teh rule for setting fix released are wrong.
it sould only be set when we release that commit in a tagged release on pypi/
tarballs.openstack.org.

** Changed in: os-vif
   Status: Fix Released => Fix Committed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1801919

Title:
  brctl is obsolete  use ip

Status in devstack:
  In Progress
Status in neutron:
  Fix Released
Status in OpenStack Compute (nova):
  Confirmed
Status in os-vif:
  Fix Committed

Bug description:
  bridge-utils (brctl) is obsolete, no modern software should depend on it.
  Used in: neutron/agent/linux/bridge_lib.py

  http://man7.org/linux/man-pages/man8/brctl.8.html

  Please use `ip` for basic bridge operations,
  than we can drop one obsolete dependency..

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1801919/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1822575] [NEW] lower-constraints are not used in gate job

2019-04-01 Thread sean mooney
Public bug reported:

the lower constraints tox env attempts to run nova's unit tests with
the minium supported software versions declared in nova lower-constraints.txt

due to the way the install command is specified in the default tox env

install_command = pip install
-c{env:UPPER_CONSTRAINTS_FILE:https://git.openstack.org/cgit/openstack/requirements/plain
/upper-constraints.txt} {opts} {packages}

the upper-constraints.txt was also passed to pip.

pips constraint solver takes the first deffintion of a constraint and
discards all redfinitoins.

because upper-constraints.txt was included before lower-constraints.txt the 
lower constraints were
ignored.

there are two patchs proposed to fix this 
https://review.openstack.org/#/c/622972 and 
https://review.openstack.org/#/c/645392

we should merge one of them.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1822575

Title:
  lower-constraints are not used in gate job

Status in OpenStack Compute (nova):
  New

Bug description:
  the lower constraints tox env attempts to run nova's unit tests with
  the minium supported software versions declared in nova lower-constraints.txt

  due to the way the install command is specified in the default tox env

  install_command = pip install
  
-c{env:UPPER_CONSTRAINTS_FILE:https://git.openstack.org/cgit/openstack/requirements/plain
  /upper-constraints.txt} {opts} {packages}

  the upper-constraints.txt was also passed to pip.

  pips constraint solver takes the first deffintion of a constraint and
  discards all redfinitoins.

  because upper-constraints.txt was included before lower-constraints.txt the 
lower constraints were
  ignored.

  there are two patchs proposed to fix this 
  https://review.openstack.org/#/c/622972 and 
https://review.openstack.org/#/c/645392

  we should merge one of them.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1822575/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821938] Re: No nova hypervisor can be enabled on workers with QAT devices

2019-04-02 Thread sean mooney
** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova
   Status: New => In Progress

** Tags added: stein-rc-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821938

Title:
  No nova hypervisor can be enabled on workers with QAT devices

Status in OpenStack Compute (nova):
  In Progress
Status in StarlingX:
  Triaged

Bug description:
  Brief Description
  -
  Unable to enable a host as nova hypervisor due to pci device cannot be found 
if the host has QAT devices (C62x or DH895XCC) configured.

  Severity
  
  Major

  
  Steps to Reproduce
  --
  - Install and configure a system where worker nodes have QAT devices 
configured. e.g.,
  [wrsroot@controller-0 ~(keystone_admin)]$ system host-device-list compute-0
  
+--+--+--+---+---+---+-++---+-+
  | name | address | class id | vendor id | device id | class name | vendor 
name | device name | numa_node | enabled |
  
+--+--+--+---+---+---+-++---+-+
  | pci__09_00_0 | :09:00.0 | 0b4000 | 8086 | 0435 | Co-processor | 
Intel Corporation | DH895XCC Series QAT | 0 | True |
  | pci__0c_00_0 | :0c:00.0 | 03 | 102b | 0522 | VGA compatible 
controller | Matrox Electronics Systems Ltd. | MGA G200e [Pilot] ServerEngines 
(SEP1) | 0 | True |
  
+--+--+--+---+---+---+-++---+-+

  compute-0:~$ lspci | grep QAT
  09:00.0 Co-processor: Intel Corporation DH895XCC Series QAT
  09:01.0 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function
  09:01.1 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function
  ...

  - check nova hypervisor-list

  Expected Behavior
  --
  - Nova hypervisors exist on system

  Actual Behavior
  
  [wrsroot@controller-0 ~(keystone_admin)]$ nova hypervisor-list
  ++-+---++
  | ID | Hypervisor hostname | State | Status |
  ++-+---++
  ++-+---++

  
  Reproducibility
  ---
  Reproducible

  System Configuration
  
  Any system type with QAT devices configured on worker node

  Branch/Pull Time/Commit
  ---
  master as of 2019-03-18

  Last Pass
  --
  on f/stein branch in early feb

  Timestamp/Logs
  --
  # nova-compute pods are spewing errors so they can't register themselves 
properly as hypervisors:
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager 
[req-4f652d4c-da7e-4516-9baa-915265c3fdda - - - - -] Error updating resources 
for node compute-0.: PciDeviceNotFoundById: PCI device :09:02.3 not found
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager Traceback (most 
recent call last):
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 
7956, in _update_available_resource_for_node
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager startup=startup)
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/compute/resource_tracker.py",
 line 727, in update_available_resource
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager resources = 
self.driver.get_available_resource(nodename)
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", 
line 7098, in get_available_resource
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager 
self._get_pci_passthrough_devices()
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", 
line 6102, in _get_pci_passthrough_devices
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager 
pci_info.append(self._get_pcidev_info(name))
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File 
"/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", 
line 6062, in _get_pcidev_info
  2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager 

[Yahoo-eng-team] [Bug 1829161] [NEW] Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='git.openstack.org', port=443)

2019-05-15 Thread sean mooney
Public bug reported:

The tempest jobs have stared to periodicaly fail with

Could not install packages due to an EnvironmentError:
HTTPSConnectionPool(host='git.openstack.org', port=443)

starting on may 6th
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Could%20not%20install%20packages%20due%20to%20an%20EnvironmentError:%20HTTPSConnectionPool(host%3D'git.openstack.org',%20port%3D443)%5C%22

based on the logstash results this has been hit ~330 times in the last 7 days
this appears to trigger more frequently on the grenade jobs but also effects 
others.

this looks like an infra issue likely related to the redicrts not working in 
all cases.
this is a tracking bug untill the issue is resolved.

** Affects: nova
 Importance: Critical
 Status: Triaged


** Tags: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1829161

Title:
  Could not install packages due to an EnvironmentError:
  HTTPSConnectionPool(host='git.openstack.org', port=443)

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  The tempest jobs have stared to periodicaly fail with

  Could not install packages due to an EnvironmentError:
  HTTPSConnectionPool(host='git.openstack.org', port=443)

  starting on may 6th
  
http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Could%20not%20install%20packages%20due%20to%20an%20EnvironmentError:%20HTTPSConnectionPool(host%3D'git.openstack.org',%20port%3D443)%5C%22

  based on the logstash results this has been hit ~330 times in the last 7 days
  this appears to trigger more frequently on the grenade jobs but also effects 
others.

  this looks like an infra issue likely related to the redicrts not working in 
all cases.
  this is a tracking bug untill the issue is resolved.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1829161/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821089] Re: assign PCI slot for VM's NIC persistently

2019-05-27 Thread sean mooney
Stable device naming within the guest is OS dependent and strictly out of scope 
of nova to fix.
nova does not chose the address at which device are attached and the nova api 
doe not
guarentee stable nic ordering. the vm pci adress is determined by libvirt.

the device role tagging feature was developed for this usecase specifically so 
that vms could determin the mapping
between device that are exposed to the guest and the openstack resouce the 
correspond to in a hyperviors and os independent way. 
https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821089

Title:
  assign PCI slot for VM's NIC persistently

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Nova doesn't care about PCI slot number where virtual NIC is attached.
  As a result guests (recent Ubuntu for example) in which NIC name depends on 
PCI slot number rename interfaces in circumstances described below:

  1. Launch VM using Ubuntu cloud image with 1 interface.

  Name of the interface will be like "ens3"

  $ lspci
  00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
  00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
  00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
  00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton 
II] (rev 01)
  00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
  00:02.0 VGA compatible controller: Cirrus Logic GD 5446
  00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

  2. Attach more interfaces (nova interface-attach).

  Attached interfaces will get names like "ens7"

  $ lspci
  00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
  00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
  00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
  00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton 
II] (rev 01)
  00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
  00:02.0 VGA compatible controller: Cirrus Logic GD 5446
  00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
  00:07.0 Ethernet controller: Red Hat, Inc Virtio network device

  3. Do "nova reboot --hard" for this VM (this action regenerates XML in
  Libvirt).

  Interfaces "ens7" will be renamed to "ens4" since Libvirt XML for this
  VM will be recreated.

  lspci
  00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
  00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
  00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
  00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton 
II] (rev 01)
  00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
  00:02.0 VGA compatible controller: Cirrus Logic GD 5446
  00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:04.0 Ethernet controller: Red Hat, Inc Virtio network device
  00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device
  00:07.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

  
  4. Compare names of interfaces after step 2 and step 3.

  Same happens after interfaces detached:
  For example if VM has ens3, ens4, ens5 then detach ens4 then ens5 will be 
renamed to renamed on hard reboot.

  Ideally I would expect from Nova to assign PCI slot number to attached
  devices and keep this assignment in XML in
  /var/lib/nova/instances//libvirt.xml

  OpenStack version: Newton (newer versions also affected)
  hypervisor: Libvirt+KVM
  networking type: Neutron with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821089/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1831723] [NEW] The flavor hide_hypervisor_id value can be overridden by the image img_hide_hypervisor_id

2019-06-05 Thread sean mooney
Public bug reported:

During the implementation of enabling hypervisor hiding for windows guests
it became apparent that a latent bug exits that allows non privaldges users
to override the policy set by the admin in the flavor by uploading a custom 
image.

by convention back in the havan/icehouse days we used to allow the flavor to 
take precendece
over the image if there was a conflcit and log a warning. sometime aound 
liberty/mitaka we decided
that was a bad user experence for endusers as they did not recive what they 
asked for and started to convert all confict into a hard error. The only case 
where we intentionally allow the image to take prescedece over the flavor is 
hw:mem_page_size where it is allows if an only if the adming has set 
hw:mem_p[age_size to large or any expcltly. in other words unless the admin has 
opted in to allowing ther image to take precendece by not setting a value in 
the flavor or setint it to a value that allows the image to refine the choice 
we do not support image overriding flavors.


the current code does exactly that by the use of a logical or

 flavor_hide_kvm = strutils.bool_from_string(
flavor.get('extra_specs', {}).get('hide_hypervisor_id'))
if (virt_type in ("qemu", "kvm") and
(image_meta.properties.get('img_hide_hypervisor_id') or
 flavor_hide_kvm)):

and the new code

hide_hypervisor_id = (strutils.bool_from_string(
flavor.extra_specs.get('hide_hypervisor_id')) or
image_meta.properties.get('img_hide_hypervisor_id'))

exibits the same behavior.

in both cases if img_hide_hypervisor_id=true and hide_hypervisor_id=false
hypervior hiding will be enabled.

in this specific case the side-effects of this are safe but it may not be in all
cases of this pattern.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1831723

Title:
  The flavor hide_hypervisor_id value can be overridden by the image
  img_hide_hypervisor_id

Status in OpenStack Compute (nova):
  New

Bug description:
  During the implementation of enabling hypervisor hiding for windows guests
  it became apparent that a latent bug exits that allows non privaldges users
  to override the policy set by the admin in the flavor by uploading a custom 
image.

  by convention back in the havan/icehouse days we used to allow the flavor to 
take precendece
  over the image if there was a conflcit and log a warning. sometime aound 
liberty/mitaka we decided
  that was a bad user experence for endusers as they did not recive what they 
asked for and started to convert all confict into a hard error. The only case 
where we intentionally allow the image to take prescedece over the flavor is 
hw:mem_page_size where it is allows if an only if the adming has set 
hw:mem_p[age_size to large or any expcltly. in other words unless the admin has 
opted in to allowing ther image to take precendece by not setting a value in 
the flavor or setint it to a value that allows the image to refine the choice 
we do not support image overriding flavors.


  the current code does exactly that by the use of a logical or

   flavor_hide_kvm = strutils.bool_from_string(
  flavor.get('extra_specs', {}).get('hide_hypervisor_id'))
  if (virt_type in ("qemu", "kvm") and
  (image_meta.properties.get('img_hide_hypervisor_id') or
   flavor_hide_kvm)):

  and the new code

  hide_hypervisor_id = (strutils.bool_from_string(
  flavor.extra_specs.get('hide_hypervisor_id')) or
  image_meta.properties.get('img_hide_hypervisor_id'))

  exibits the same behavior.

  in both cases if img_hide_hypervisor_id=true and hide_hypervisor_id=false
  hypervior hiding will be enabled.

  in this specific case the side-effects of this are safe but it may not be in 
all
  cases of this pattern.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1831723/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1831886] [NEW] default pcie hotplug behavior changes when using q35

2019-06-06 Thread sean mooney
Public bug reported:

The q35 machine type support native pcie instead of legacy ahci
based pci hotplug. This has several advantages and one majour disadvantage
with the new pcie approch you need to pre allocate the pcie slot so that
they will be available for use with hotplug if needed.

to support this a new num_pcie_ports config option was added to the libvirt 
section of
the nova.conf

https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.num_pcie_ports

the default value of this config is 0 which mean we use libvirts default.
libvirts default is to allocate 1 free pcie port as a result by default
you cannot attach more then 1 device without hard rebooting the vm.

previously when using the pc machine type with the i440fx chipset it was 
possible to attach
multiple interfaces or volumes. as a result the end user behavior has changed
as observed by the failrure in 
tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port
 with the default setting and q35 enabled as 
reported in this downstream bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1716356

to fix this the suggestion is to set the max value and default value of the
num_pcie_ports config option to 32

based on some minimal local testing the memory usage of this change is ~0.4MB 
per port or ~12.5 mb per vm in addtion qemu overhead. this is based on testing 
done with libvirt directly with memroy preallocation enabeld
for a 2G guest with the pc machinetype and i440fx chipset  total memroy of 2036 
MB was observed, 
q35 4 ports (the default value that will be calulated by libvirt for the 
default devices)  increesed this to  2056 MB and q35 32 ports to 2066 MB

as such this i a minimal overhead increase which can still be controlled
by setting the config to a lower value explicitly.

** Affects: nova
 Importance: Low
 Assignee: Kashyap Chamarthy (kashyapc)
 Status: In Progress


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1831886

Title:
  default pcie hotplug behavior changes when using q35

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  The q35 machine type support native pcie instead of legacy ahci
  based pci hotplug. This has several advantages and one majour disadvantage
  with the new pcie approch you need to pre allocate the pcie slot so that
  they will be available for use with hotplug if needed.

  to support this a new num_pcie_ports config option was added to the libvirt 
section of
  the nova.conf

  
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.num_pcie_ports

  the default value of this config is 0 which mean we use libvirts default.
  libvirts default is to allocate 1 free pcie port as a result by default
  you cannot attach more then 1 device without hard rebooting the vm.

  previously when using the pc machine type with the i440fx chipset it was 
possible to attach
  multiple interfaces or volumes. as a result the end user behavior has changed
  as observed by the failrure in 
tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port
 with the default setting and q35 enabled as 
  reported in this downstream bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1716356

  to fix this the suggestion is to set the max value and default value of the
  num_pcie_ports config option to 32

  based on some minimal local testing the memory usage of this change is ~0.4MB 
per port or ~12.5 mb per vm in addtion qemu overhead. this is based on testing 
done with libvirt directly with memroy preallocation enabeld
  for a 2G guest with the pc machinetype and i440fx chipset  total memroy of 
2036 MB was observed, 
  q35 4 ports (the default value that will be calulated by libvirt for the 
default devices)  increesed this to  2056 MB and q35 32 ports to 2066 MB

  as such this i a minimal overhead increase which can still be
  controlled by setting the config to a lower value explicitly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1831886/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1832169] Re: device_type of PCI alias config could be mismatched

2019-06-13 Thread sean mooney
the device_type is optional but if set it will be checked

https://github.com/openstack/nova/blob/51e3787bf89f19af8a9d37288a63731563c92fca/nova/pci/request.py#L136-L138

type-pci is not intended for fore use with device that are capable of
sriov and exits primarly for use with pci device that are not nics.


type-PCI is reserved for device that will be passthough via the pci aliase in 
the falvor that should not be request able by neutron based sriov port.

it is generally used for gpus, crypto cards like intel QAT devices or
nics that are not managed by neutron and do not support sriov.

type-PF is use for device that weill be request using neutron 
vnic_type=direct-physical.
and 
type-VF is used for edvice that eill be requested using neutron 
vnic_type=direct.

type-pf and type-pf may also be used for non nic device but in that case the 
physical_network tage must
not be set in the pci whitelist.

when we process a neutron prot we translate form the port vnic type to teh 
correct device_type here.
https://github.com/openstack/nova/blob/212607dc6feaf311ba92295fd07363b3ee9ae010/nova/network/neutronv2/api.py#L2046-L2060

when enumarting the devices in the libvirt virt dirver here
https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/driver.py#L6076-L6113

we interperat the device capabilities to determine if the which type to
use when reporting the device.

depending on the nic, firmware and and dirver otions the present of the 
virtual_fucntion capablity in the
pci capablityis reported by libvirt can change.

that is to say on older generation intel nicantics such as the intel 82599 
series
the presence of the virtual_fucntion cabpablity was ondition on if data center 
bridgeing 
was enabled in teh fireware.

in data center bridgeing mode sriov was disabled to allow VMDQ to be used so 
even with the same vendor and product id
the device type can change.


when a device support sriov and is listed as a PF there are also addtional 
checks that the schduler and pci resouce tracker must perfrom to determing that 
a PF is availble for assignemnt to a vm. The most import being the pci resouce 
track must first confirm that the PF etiher has no VFs or that all VFs are free.

For type-PCI we do not have to do that check as we know it does not
support sriov and thereforce will not have VF that could be in use.



** Changed in: nova
   Importance: Undecided => Wishlist

** Changed in: nova
   Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1832169

Title:
  device_type of PCI alias config could be mismatched

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Currently, to use PCI passthrough functionality admin should specify
  the alias of PCI devices and the format is like below

  alias = { "vendor_id":"8086", "product_id":"1528", "device_type
  ":"type-PCI", "name":"nic" }

  What I think confusing for this configuration is that there is just
  one "device_type" for the device. I assume that device_type is not
  needed to the device be identified since libvirt made the device_type
  for one device.

  IOW, I suspects it never happens like below.
  alias = { "vendor_id":"8086", "product_id":"1528", "device_type":"type-PCI", 
"name":"nic" }
  alias = { "vendor_id":"8086", "product_id":"1528", "device_type":"type-PF", 
"name":"nic" }

  I strongly believe the PCI device having 8086:1528 ID is just already
  set the unique device_type., I'm not 100% sure though.

  So my point is it's better to delete device_type attribute for the
  config so that admin does not care about the device type. I think it's
  big barrier to use PCI passthrough functionality for whom does not
  familiar with the concept.

  Thanks.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1832169/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1824048] Re: SRIOV pci_numa_policy dosen't working when create instance with 'cpu_policy' and 'num_nodes'

2019-06-13 Thread sean mooney
*** This bug is a duplicate of bug 1805891 ***
https://bugs.launchpad.net/bugs/1805891

** This bug has been marked a duplicate of bug 1805891
   pci numa polices are not followed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1824048

Title:
  SRIOV pci_numa_policy dosen't working when create instance with
  'cpu_policy' and 'num_nodes'

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  When we create a sriov instance, which flavor has 'cpu_policy', 'numa_nodes' 
property, it's also has 'pci_numa_policy=preferred' property that indicate we 
are able to allocate pci devices and vcpu in different numa node. 
  However, in some cases, it didn't work, because the fake pci request, which 
produced by nova flavor and [pci] alias in nova.conf hasn't write related 
information (such as pci_numa_policy, alias_name and some spec info) into real 
pci_requests (which contain port_id). So, in nova/pci/stats.py function 'def 
_filter_pools_for_numa_cells', there will filter all pci devices. 

  Environment
  ===
  Openstack Queen
  compute node information: Two numa nodes(node-0 node-1), SRIOV-PCI devices 
associated with NUMA node-1, but cpus of node-1 have run out.

  
  Steps to reproduce
  ==
  nova.conf
  [pci]
  alias = {"name": "QuickAssist","product_id": "10ed","vendor_id": 
"8086","device_type": "type-VF","numa_policy": "preferred"}

  nova flavor
  
++-+
  | Property   | Value  

 |
  
++-+
  | OS-FLV-DISABLED:disabled   | False  

 |
  | OS-FLV-EXT-DATA:ephemeral  | 0  

 |
  | disk   | 20 

 |
  | extra_specs| {"hw:pci_numa_policy": "preferred", 
"hw:cpu_policy": "dedicated",  "hw:numa_nodes": "1", "hw:cpu_cores": "4", 
"pci_passthrough:alias": "QuickAssist:1"} |
  | id | 430e1afd-a72b-41c6-b9b2-ea9b6aa9f037   

 |
  | name   | multiqueue 

 |
  | os-flavor-access:is_public | True   

 |
  | ram| 2048   

 |
  | rxtx_factor| 1.0

 |
  | swap   |

 |
  | vcpus  | 4  

 |
  
++-+

  neutron port: one or some 'direct' ports;


  Expected result
  ===
  The instance coul

[Yahoo-eng-team] [Bug 1805891] Re: pci numa polices are not followed

2019-06-13 Thread sean mooney
as this feature never worked on rocky and queens i am marking it as wont
fix as it would be effectivly a feature backport based on matt's comment
here https://review.opendev.org/#/c/641653/1//COMMIT_MSG@13

** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Changed in: nova/stein
   Status: New => Fix Released

** Changed in: nova/rocky
   Status: New => Won't Fix

** Changed in: nova/queens
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1805891

Title:
  pci numa polices are not followed

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Won't Fix
Status in OpenStack Compute (nova) rocky series:
  Won't Fix
Status in OpenStack Compute (nova) stein series:
  Fix Released

Bug description:
  Description
  ===
  
https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/share-pci-between-numa-nodes.html
  introduced the concept of numa affinity policies for pci passthough devices.

  upon testing it was observed that the prefer policy is broken.

  for contested  there is a sperate bug to track the lack of support for 
neutron sriov interfaces.
  https://bugs.launchpad.net/nova/+bug/1795920 so the scope of this bug is 
limited
  pci numa policies for passtrhough devices using a flavor alias.

  
  background
  --

  by default in nova pci devices are numa affinitesed using the legacy policy.
  but you can override this behavior via the alias. when set to prefer nova 
  should fall back to no numa affintiy bwteen the guest and the pci devce
  if a device on a local numa node is not availeble.

  the policies are discibed below.

  legacy

  This is the default value and it describes the current nova
  behavior. Usually we have information about association of PCI devices
  with NUMA nodes. However, some PCI devices do not provide such
  information. The legacy value will mean that nova will boot instances
  with PCI device if either:

  The PCI device is associated with at least one NUMA nodes on which 
the instance will be booted
  There is no information about PCI-NUMA affinity available


  preferred

  This value will mean that nova-scheduler will choose a compute
  host with minimal consideration for the NUMA affinity of PCI devices.
  nova-compute will attempt a best effort selection of PCI devices based
  on NUMA affinity, however, if this is not possible then nova-compute
  will fall back to scheduling on a NUMA node that is not associated
  with the PCI device.

  Note that even though the NUMATopologyFilter will not consider
  NUMA affinity, the weigher proposed in the Reserve NUMA Nodes with PCI
  Devices Attached spec [2] can be used to maximize the chance that a
  chosen host will have NUMA-affinitized PCI devices.


  Steps to reproduce
  ==

  the test case was relitively simple

  - deploy a singel node devstack install on a host with 2 numa nodes.
  - enable the pci and numa topology fileters
  - whitelist a pci device attach to numa_node 0
e.g. passthrough_whitelist = { "address": ":01:00.1" }
  - adust the vcpu_pin_set to only list the cpus on numa_node 1
e.g. vcpu_pin_set=8-15
  - crate an alias in the pci section of the nova.conf
alias = { "vendor_id":"8086", "product_id":"10c9", "device_type":"type-PF", 
"name":"nic-pf", "numa_policy": "preferred"}
  - restart the nova services
sudo systemctl restart devstack@n-*

  - update a flavour with the alias and a numa toplogy of 1
   openstack flavour set --property pci_passthrough:alias='nic-pf:1' 42
   openstack flavour set --property hw:numa_nodes=1 42

  
  
++-+
  | Field  | Value  
 |
  
++-+
  | OS-FLV-DISABLED:disabled   | False  
 |
  | OS-FLV-EXT-DATA:ephemeral  | 0  
 |
  | access_project_ids | None   
 |
  | disk   | 0  
 |
  | id | 42 
 |
  | name   | m1.nano
 |
  | os-flavor-access:is_public | True   
 |
  | properties | hw:numa_nodes='1', 
pci_passthrough:alias='nic-pf:1' |
  | ram| 64

[Yahoo-eng-team] [Bug 1802973] Re: Failed to create VM with no IP assigned to SR-IOV port

2019-06-13 Thread sean mooney
Nova does not currently have support for neutron ports without an ip.

when support was added for the neutron port ip_allocation polices only support
for intimidate and defer where implemented.

i believe work is planned to add support for addresses port in train but
i am close this as invalid as it had never been supported.

** Tags added: neutr

** Tags removed: neutr
** Tags added: libvirt neutron

** Changed in: nova
   Importance: Undecided => Wishlist

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1802973

Title:
  Failed to create VM with no IP assigned to SR-IOV port

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===
  Failed to create an instance because of a port attached with no IP address 
assigned. The port is created over a flat network mapped to a SR-IOV interface.

  Steps to reproduce
  ==
  A chronological list of steps which will bring off the
  issue you noticed:

  1. Create network

  openstack network create --provider-physical-network physnet1
  --provider-network-type flat sriov-net

  
  2. Create port 

  openstack port create --network  --vnic-type macvtap (direct)
  --no-fixed-ip sriov-port

  
  3. Create 

  
  openstack server create --image cloud.img --flavor your_flavor --key-name 
ssh-key --port sriov-port vm_name


  Expected result
  ===
  The instance should start with a layer 2 interface configured over a sr-iov 
virtual function.

  Actual result
  =
  Port 6dca94cb-1ed5-4131-bc69-4736db5f9f18 requires a FixedIP in order to be 
used. (HTTP 400)

  Environment
  ===
  1. Openstack Rocky managed via Juju and deployed using MAAS.


  2. Which hypervisor did you use?
 Libvirt + KVM


  2. Which storage type did you use?
 Ceph

  
  3. Which networking type did you use?
 Neutron

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1802973/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1768917] Re: PCI-Passthrough documentation is incorrect while trying to pass through a NIC

2019-06-13 Thread sean mooney
you are conflating two different things

alias are not used for neutron based sriov networking

and nic that are pastthoyhg via flaovr aliase are not managed by neutorn
or the sriov nic agent.

the documentation in https://docs.openstack.org/nova/pike/admin/pci-
passthrough.html

describe how to do generic pci pasthoh of a host pci device not neutron
sriov driect-phyical passthough.

to give a PF to a vm that is managed by neutron you create a port with
vnic_type=direct-physical.

in that scenario when whitelisting the nic you also need to add the
physical_network in the whitelist.


the flavor and alias based approach described in 
https://docs.openstack.org/nova/pike/admin/pci-passthrough.html

is intened for passing through device like gpus or acllerator cards like
intel qat devices.

the docs use the vendor and product ids of an intel niantic simple because that 
is what we tested
this functionality with when it was implemented but we could have used  a QAT 
device in the example which
not work with neturon sriov.

 | [pci]
| alias = '{
|   "name": "QuickAssist",
|   "product_id": "0443",
|   "vendor_id": "8086",
|   "device_type": "type-PCI",
|   "numa_policy": "legacy"
|   }'



** Changed in: nova
   Importance: Undecided => Low

** Changed in: nova
   Status: New => Invalid

** Tags added: docs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1768917

Title:
  PCI-Passthrough documentation is incorrect while trying to pass
  through a NIC

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  As per the documentation shown below

  https://docs.openstack.org/nova/pike/admin/pci-passthrough.html

  In order to achieve PCI passthrough of a network device, it states
  that we should create a 'flavor' based on the alias and then associate
  a flavor to the server create function.

  Steps to follow:

  Create an Alias:
  [pci]
  alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", 
"name":"a1" }

  Create a Flavor:
  [pci]
  alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", 
"name":"a1" }

  Add a whitelist:
  [pci]
  passthrough_whitelist = { "address": ":41:00.0" }

  Create a Server with the Flavor:

  # openstack server create --flavor m1.large --image
  cirros-0.3.5-x86_64-uec --wait test-pci

  
  With the above command, the VM creation errors out and we see a 
PortBindingFailure.

  The reason for the PortBindingFailure is the 'vif_type' is always set
  to 'BINDING_FAILED".

  The reason being, flavor does not mention about the 'vnic_type
  '='direct-physical' without this information the sriov mechanism
  driver is not able to bind the port.

  Not sure if there is any way to specify the info in the flavor.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1768917/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1768919] Re: PCI-Passthrough fails when we have Flavor configured and provide a port with vnic_type=direct-physical

2019-06-13 Thread sean mooney
*** This bug is a duplicate of bug 1768917 ***
https://bugs.launchpad.net/bugs/1768917

i have closed this as a duplicate as i explain in the other bug that you
miss understood how to use this feature.

based on teh in fomation you provdied on the ohter bug i am assumin you
have only one nic avaiable on the host and you are requesting it twice 1
via the alais and again via the neutron port.

that is inccorect.

you need 1 device for each request.

to use neutron PF pasthough (vnic_type=direct-physical) you should not
also specify a flavor alais unless you are using that to request a
different device.

noav will convert a port with vnic_type=direct-phyical into a pci
request internally.

** This bug has been marked a duplicate of bug 1768917
   PCI-Passthrough documentation is incorrect while trying to pass through a NIC

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1768919

Title:
  PCI-Passthrough fails when we have Flavor configured and provide a
  port with vnic_type=direct-physical

Status in OpenStack Compute (nova):
  New

Bug description:
  PCI-Passthrough of a NIC device to the VM fails, when we have both the
  Flavor configured with Alias and also provide a network port with
  'vnic_type=direct-physical'.

  
  The comment shown in the source code shown below,

  
https://github.com/openstack/nova/blob/644ac5ec37903b0a08891cc403c8b3b63fc2a91c/nova/compute/api.py#L812
  # PCI requests come from two sources: instance flavor and
  # requested_networks. The first call in below returns an
  # InstancePCIRequests object which is a list of InstancePCIRequest
  # objects. The second call in below creates an InstancePCIRequest
  # object for each SR-IOV port, and append it to the list in the
  # InstancePCIRequests object

  In this case there would be two PCI-requests for the same device and
  _test_pci fails when the compute tries to check for the Claims.

  088d81f6653242318245b137b1ef91c7] _test_pci 
/opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:201
  2018-04-30 22:17:06.058 13396 DEBUG nova.compute.claims 
[req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 
088d81f6653242318245b137b1ef91c7] pci requests: 
[InstancePCIRequest(alias_name='intel10fb',count=1,is_new=False,request_id=None,spec=[{dev_type='type-PF',product_id='10fb',vendor_id='8086'}]),
 
InstancePCIRequest(alias_name=None,count=1,is_new=False,request_id=13befe5f-478f-4f4c-aa72-78cce84d942d,spec=[{dev_type='type-PF',physical_network='physnet2'}])]
 _test_pci 
/opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:202
  2018-04-30 22:17:06.059 13396 DEBUG nova.compute.claims 
[req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 
088d81f6653242318245b137b1ef91c7] PCI request stats failed  _test_pci 
/opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:206
  2018-04-30 22:17:06.059 13396 DEBUG oslo_concurrency.lockutils 
[req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 
088d81f6653242318245b137b1ef91c7] Lock "compute_resources" released by 
"nova.compute.resource_tracker.instance_claim" :: held 0.059s inner 
/opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
  2018-04-30 22:17:06.060 13396 DEBUG nova.compute.manager 
[req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 
088d81f6653242318245b137b1ef91c7] [instance: 
39ad3a47-66dc-4114-9653-fee5ee0c87dc] Insufficient compute resources: Claim pci 
failed.. 

  Not sure why the Claim pci failed for the same device entry twice.

  Probably if the device id is the same on both Flavor and network, then
  it should only compose one entry since they both are identical.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1768919/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1817683] Re: Key pair not imported when passing cloud-init script on initiation

2019-06-13 Thread sean mooney
i belive you are correct that this behavior is caused by the fact you
are creating a usever via cloud init.

if you think this is really a nova bug feel free to set the status back
to New for the bug to be retriaged.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1817683

Title:
  Key pair not imported when passing cloud-init script on initiation

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===

  The public SSH key is not imported when an instance is created with a
  key pair (key pair tab) + cloud-init script (configuration tab)

  - Reproduced in dashboard (Horizon)
  - Reproduced with python (nova.server.create())

  
  Steps to reproduce
  ==
  - Create an instance in the GUI
  - with a key pair

  Key pair is inserted
  
[   22.212331] cloud-init[993]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 
'modules:config' at Tue, 26 Feb 
2019 09:44:27 +. Up 21.13 seconds.
[[0;32m  OK  [0m] Started Apply the settings specified in cloud-config.
   Starting Execute cloud user/final scripts...
ci-info: +Authorized keys from /home/ubuntu/.ssh/authorized_keys for 
user ubuntu++
ci-info: 
+-+-+-+-+
ci-info: | Keytype |Fingerprint (md5)| 
Options | Comment |
ci-info: 
+-+-+-+-+
ci-info: | ssh-rsa | 36:b4:ea:45:0a:77:c4:87:c9:71:d5:78:6e:a5:ee:ba |- 
   |-|
ci-info: 
+-+-+-+-+

=> login to VM with key pair
-> Login successful 

  - Create a second instance
- with a key pair
- pass a cloud-init script in the user configuration 

#cloud-config
chpasswd: 
  expire: false
  list: |
  root:toor
  jelle:jelle
users: 
  - name: jelle
lock-passwd: false
sudo: ['ALL=(ALL) NOPASSWD:ALL']
groups: sudo
shell: /bin/bash

  ==> Public key from the key-pair is not imported

  [   21.472835] cloud-init[937]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 
'modules:config' at Tue, 26 Feb 2019 09:36:21 +. Up 20.47 seconds.
  [[0;32m  OK  [0m] Started Apply the settings specified in cloud-config.
   Starting Execute cloud user/final scripts...
  ci-info: no authorized ssh keys fingerprints found for user jelle.
  <14>Feb 26 09:36:23 ec2: 
  <14>Feb 26 09:36:23 ec2: 
#
  <14>Feb 26 09:36:23 ec2: -BEGIN SSH HOST KEY FINGERPRINTS-
  <14>Feb 26 09:36:23 ec2: 1024 
SHA256:mfFrY4zKFLuJPRF6Pw6z8suzBzA7jx21sife3MwEee4 root@test (DSA)
  <14>Feb 26 09:36:23 ec2: 256 
SHA256:JzA4J0A6oN5c1vTiGpTPBgqisb1IlxXBumlnk/Jg1Po root@test (ECDSA)
  <14>Feb 26 09:36:23 ec2: 256 
SHA256:j/mU93YAfgHxdrXJD0QT6SMFFoOzRvtES/YZ+9ZBNaM root@test (ED25519)
  <14>Feb 26 09:36:23 ec2: 2048 
SHA256:Hy1gMvK/7hSoyIacAgx+C/jEHkbCi5yS9YbiYfcTVGo root@test (RSA)
  <14>Feb 26 09:36:23 ec2: -END SSH HOST KEY FINGERPRINTS-
  <14>Feb 26 09:36:23 ec2: 
#
  -BEGIN SSH HOST KEY KEYS-
  ecdsa-sha2-nistp256 
E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBGBMYWNnP97Znq6Al0LHqzUu8tOa3/T4fuh+PLAIW26b2361MarI/1HxxseRmCUgb45Gw5zXu7CfLhAlHaThirk=
 root@test
  ssh-ed25519 
C3NzaC1lZDI1NTE5IJ54epYzeKPsUs8UXyac+nTPQGpNY2CQWwBQL4aEPZD6 root@test
  ssh-rsa 
B3NzaC1yc2EDAQABAAABAQCwtmWLjZrRB4BVxcWAZt8/uWkkQhMCkrdNQTS40ZGTGto46MyBmyA+4RJxnZ8MV9I/8lpBt1EY5ERdf/5gDwN51wzq57LVuTz46mhYU3i85YECaE98VXG9I52OC0/UzgvlEbwEbVPlMh+ZVkNSkZu4Mcuvi0hvzU7+Z5p8CvWEMhIvtWAKbf/ujK0WzeYRwsqQfGm5hUH6TJSjFRCC/T1DosnM+hgDlNkiYGjlUE9LvSPRTX1rMfakUbWzK/EJWuGuYO21P/oORNDeJxWPZS/Y8cW+VCQbXCuXqXFst347Tvnl/kmZULjRJjB05eAV6Ejto2tRbCku49POA26/GzMj
 root@test
  -END SSH HOST KEY KEYS-
  [   22.295189] cloud-init[995]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 
'modules:final' at Tue, 26 Feb 2019 09:36:23 +. Up 22.06 seconds.
  [   22.299328] cloud-init[995]: ci-info: no authorized ssh keys fingerprints 
found for user jelle.
  [   22.301658] cloud-init[995]: Cloud-init v. 18.4-0ubuntu1~18.04.1 finished 
at Tue, 26 Feb 2019 09:36:23 +. Datasource DataSourceOpenStackLocal 
[net,ver=2].  Up 22.27 seconds

=> Login with keypair
-> login fails



Environment
===


ubuntu@juju-5dc387-0-lxd-6:~$ nova-manage --version
15.1.5

ubuntu@juju-5dc387-0-lxd-6:~$ dpkg -l | grep nova
ii  nova-api-os-compute  2:15.1.5-0ubuntu1~cloud0   
all  OpenStack Compute - OpenStack Compute API frontend
ii  nova-common  

[Yahoo-eng-team] [Bug 1815762] Re: you can end up in a state where qvo* interfaces aren't owned by ovs which results in a dangling connection

2019-06-13 Thread sean mooney
this might be somthing that could be added to the exsiting 
neutron-ovs-cleanup script
that is generated by this entry point 
https://github.com/openstack/neutron/blob/master/setup.cfg#L49
and impmeneted here 
https://github.com/openstack/neutron/blob/master/neutron/cmd/ovs_cleanup.py

but this should not live in nova.

** Also affects: neutron
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1815762

Title:
  you can end up in a state where qvo* interfaces aren't owned by ovs
  which results in a dangling connection

Status in neutron:
  New
Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  While upgrading to rocky, we ended up with a broken openvswitch
  infrastructure and moved back to the old openvswitch.

  We ended up with new machines working, old machines didn't and it took
  a while to realize that we had qvo* interfaces that not only wasn't
  plugged but also wasn't owned by ovs-system - basically the virtual
  equivalent of forgetting to plug in the cable ;)

  This was quickly addressed by running this bash-ism on all nodes:
  for x in `ip a |grep qvo |grep @qvb |grep -v ovs-system | awk '{ print $2 '}` 
; do y=${x%%"@"*} && ip link delete $y ; done ; docker restart nova_compute

  However, nova could pretty easily sanity check this =)

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1815762/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813446] Re: impl_rabbit timedout

2019-06-13 Thread sean mooney
making as invalid as there appears to be sever different issues from you
database to the kernel locking up which seam unrelated to nova.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1813446

Title:
  impl_rabbit timedout

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  I have problems running instance after creation it says hosts not
  found

  it was because the mysql had gone

  so I applied this fix I found on the internet

  mysql --max_allowed_packet=25G

  !/usr/bin/env python2.7
  import time
  import mysql.connector

  now I have this error , traced it:

  Error: Unable to create the server.

  Unexpected API Error. Please report this at
  http://bugs.launchpad.net/nova/ and attach the Nova API log if
  possible. 
  (HTTP 500) (Request-ID: req-165fa813-f601-4aed-b584-bf847ae764b7)

  also it shows in nova-api.log

  2019-01-26 22:24:54.143 113216 ERROR nova.api.openstack.wsgi
  2019-01-26 22:24:54.450 113216 INFO nova.api.openstack.wsgi 
[req-165fa813-f601-4aed-b584-bf847ae764b7 c48c372dabe14b24aeec0408d345f30d 
d159ec3920b94490a9a85ed183482acc - default default] HTTP exception thrown: 
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and 
attach the Nova API log if possible.
  [root@amer swift(keystone_admin)]# 

  I have a VM of 8G 8 cores and 1 TB HDD

  pinging is successful from my host to the outside world:

  [root@amer swift(keystone_admin)]# ping google.com
  PING google.com (172.217.19.46) 56(84) bytes of data.
  64 bytes from ham02s11-in-f46.1e100.net (172.217.19.46): icmp_seq=1 ttl=53 
time=83.6 ms
  64 bytes from ham02s11-in-f46.1e100.net (172.217.19.46): icmp_seq=2 ttl=53 
time=87.8 ms

  I can ping myself also:
  [root@amer swift(keystone_admin)]# ping amer.example.com
  PING amer.example.com (192.168.43.110) 56(84) bytes of data.
  64 bytes from amer.example.com (192.168.43.110): icmp_seq=1 ttl=64 time=0.044 
ms
  64 bytes from amer.example.com (192.168.43.110): icmp_seq=2 ttl=64 time=0.053 
ms
  64 bytes from amer.example.com (192.168.43.110): icmp_seq=3 ttl=64 time=0.046 
ms
  64 bytes from amer.example.com (192.168.43.110): icmp_seq=4 ttl=64 time=0.082 
ms

  openstack compute service list gives:

  [root@amer swift(keystone_admin)]# openstack compute service list
  
++--+--+--+-+---++
  | ID | Binary   | Host | Zone | Status  | State | 
Updated At |
  
++--+--+--+-+---++
  |  4 | nova-conductor   | amer.example.com | internal | enabled | up| 
2019-01-27T03:43:17.00 |
  |  5 | nova-scheduler   | amer.example.com | internal | enabled | up| 
2019-01-27T03:43:24.00 |
  |  7 | nova-consoleauth | amer.example.com | internal | enabled | up| 
2019-01-27T03:43:15.00 |
  |  8 | nova-compute | amer.example.com | nova | enabled | up| 
2019-01-27T03:43:18.00 |
  
++--+--+--+-+---++
  [root@amer swift(keystone_admin)]#

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1813446/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1802218] Re: An instance created by openstack rocky can't be remotely connected by xshell and putty, But SSH tools for Linux systems do. [server's host key did not match the sign

2019-06-13 Thread sean mooney
marking as invalid as this is likely an issue wiht the ssh client you are using 
ro the ssh server in the guest.
the fact it works on linux but not windows/android suggest to me it might be 
related to the authentication methods
or encryption algothims and key types supproted in the client/server and is 
likely not related to openstack/nova

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1802218

Title:
  An instance created by openstack rocky can't be remotely connected by
  xshell and putty, But SSH tools for Linux systems do. [server's host
  key did not match the signature supplied]

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  The mirror images adopted include:
  1. Use xshell, putty, JuiceSSH cannot connect 
[http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud-1805.qcow2]
  2. Cannot connect with xshell, putty and JuiceSSH [centos7.4 made by myself, 
no problem with Ocata version built before]
  3. With xshell, putty and JuiceSSH, you can connect 
[http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img]
  After the failure, attempts were made to recreate the SSH keys in the system, 
but to no avail.

  The order is as follows:
  [root@host-192-168-1-10 ~]# rm -f /etc/ssh/ssh_host_*
  [root@host-192-168-1-10 ~]# systemctl restart sshd.service

  
  It environment
  [root@all-in-one-202 ~]# rpm -qa | grep rocky
  centos-release-openstack-rocky-1-1.el7.centos.noarch
  [root@all-in-one-202 ~]# 
  [root@all-in-one-202 ~]# rpm -qa | grep nova
  openstack-nova-conductor-18.0.2-1.el7.noarch
  openstack-nova-console-18.0.2-1.el7.noarch
  openstack-nova-api-18.0.2-1.el7.noarch
  python2-novaclient-11.0.0-1.el7.noarch
  openstack-nova-common-18.0.2-1.el7.noarch
  openstack-nova-placement-api-18.0.2-1.el7.noarch
  openstack-nova-compute-18.0.2-1.el7.noarch
  openstack-nova-novncproxy-18.0.2-1.el7.noarch
  openstack-nova-scheduler-18.0.2-1.el7.noarch
  python-nova-18.0.2-1.el7.noarch
  [root@all-in-one-202 ~]#
  [root@all-in-one-202 ~]# rpm -qa | egrep -i "libvirt|kvm"
  libvirt-daemon-driver-network-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.8.x86_64
  libvirt-libs-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.8.x86_64
  qemu-kvm-common-ev-2.10.0-21.el7_5.7.1.x86_64
  libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-secret-3.9.0-14.el7_5.8.x86_64
  qemu-kvm-ev-2.10.0-21.el7_5.7.1.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-interface-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-kvm-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-qemu-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-3.9.0-14.el7_5.8.x86_64
  libvirt-client-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-3.9.0-14.el7_5.8.x86_64
  libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.8.x86_64
  libvirt-python-3.9.0-1.el7.x86_64
  [root@all-in-one-202 ~]#
  [root@all-in-one-202 ~]# rpm -qa | grep neutron
  python2-neutron-lib-1.18.0-1.el7.noarch
  python2-neutronclient-6.9.1-1.el7.noarch
  openstack-neutron-common-13.0.1-2.el7.noarch
  openstack-neutron-ml2-13.0.1-2.el7.noarch
  openstack-neutron-13.0.1-2.el7.noarch
  python-neutron-13.0.1-2.el7.noarch
  openstack-neutron-linuxbridge-13.0.1-2.el7.noarch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1802218/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky

2019-06-15 Thread sean mooney
** Changed in: os-vif
   Status: New => Invalid

** Changed in: nova
   Status: New => In Progress

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1815989

Title:
  OVS drops RARP packets by QEMU upon live-migration causes up to 40s
  ping pause in Rocky

Status in neutron:
  In Progress
Status in OpenStack Compute (nova):
  In Progress
Status in os-vif:
  Invalid

Bug description:
  This issue is well known, and there were previous attempts to fix it,
  like this one

  https://bugs.launchpad.net/neutron/+bug/1414559

  
  This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova 
libvirt and neutron ovs agent all run inside containers.

  So far the only simply fix I have is to increase the number of RARP
  packets QEMU sends after live-migration from 5 to 10. To be complete,
  the nova change (not merged) proposed in the above mentioned activity
  does not work.

  I am creating this ticket hoping to get an up-to-date (for Rockey and
  onwards) expert advise on how to fix in nova-neutron.

  
  For the record, below are the time stamps in my test between neutron ovs 
agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 
10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd 
last packet barely made through.

  openvswitch-agent.log:

  2019-02-14 19:00:13.568 73453 INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port
  57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {},
  'network_qos_policy_id': None, 'qos_policy_id': None,
  'allowed_address_pairs': [], 'admin_state_up': True, 'network_id':
  '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648,
  'fixed_ips': [

  {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': 
'10.0.1.4'}
  ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 
'mac_address': 'fa:16:3e:de:af:47', 'device': 
u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 
'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 
'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']}
  2019-02-14 19:00:13.568 73453 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan 
for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42

   
  tcpdump for rarp packets:

  [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev
  tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 
262144 bytes

  19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46
  19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 
62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 
tell fa:16:3e:de:af:47, length 46

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1774252] Re: Resize confirm fails if nova-compute is restarted after resize

2019-06-19 Thread sean mooney
*** This bug is a duplicate of bug 1774249 ***
https://bugs.launchpad.net/bugs/1774249

** This bug has been marked a duplicate of bug 1774249
   update_available_resource will raise DiskNotFound after resize but before 
confirm

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1774252

Title:
  Resize confirm fails if nova-compute is restarted after resize

Status in OpenStack Compute (nova):
  New

Bug description:
  Originally reported in RH bugzilla:
  https://bugzilla.redhat.com/show_bug.cgi?id=1584315

  Reproduced on OSP12 (Pike).

  After resizing an instance but before confirm,
  update_available_resource will fail on the source compute due to bug
  1774249. If nova compute is restarted at this point before the resize
  is confirmed, the update_available_resource period task will never
  have succeeded, and therefore ResourceTracker's compute_nodes dict
  will not be populated at all.

  When confirm calls _delete_allocation_after_move() it will fail with
  ComputeHostNotFound because there is no entry for the current node in
  ResourceTracker. The error looks like:

  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager 
[req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 
d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: 
ComputeHostNotFound: Compute host compute-1.localdomain could not be found.
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last):
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in 
_error_out_instance_on_exception
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in 
_confirm_resize
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node)
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in 
_delete_allocation_after_move
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] cn_uuid = rt.get_node_uuid(nodename)
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0]   File 
"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, 
in get_node_uuid
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] raise 
exception.ComputeHostNotFound(host=nodename)
  2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 
1374133a-2c08-4a8f-94f6-729d4e58d7e0] ComputeHostNotFound: Compute host 
compute-1.localdomain could not be found.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1774252/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1835822] [NEW] vms loose acess to config drive with CONF.force_config_drive=True after hard reboot

2019-07-08 Thread sean mooney
Public bug reported:

The fix to bug https://bugs.launchpad.net/nova/+bug/1827492
https://review.opendev.org/#/c/659703/8

changed the behavior of nova.virt.configdrive.required_by
to depend on instance.launched_at

https://review.opendev.org/#/c/659703/8/nova/virt/configdrive.py@196

but did not reorder
https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758

as a result when nova.compute.manager._update_instance_after_spawn
is called instance.launched_at is always set before we call 
nova.virt.configdrive.update_instance

as a result instance.config_drive will always be set to false if not set
on the api.

this results in a vm that is spawned on a host with force_config_drive=True 
initally spawning
with a config drive but loosing it after a hard reboot.

for any deployment that uses config driver for vendor data or device
role tagging because they do not deploy the metadata service this is a
regressions as they cannot fall back to the metadta service.

this also might cause issue for deployment that support the deprectated file 
injection api that is part of the v2.1 api as the files are only stored in the 
config drive and are not part metadta endoint
note: i have not checked if we autoset instnace.config_drive when you use file 
injection or not so it may be unaffected since the breakage of the other 
support uscases is enough to justify this bug.


the fix is simple jsut swap the order of 
https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758

and then instance will have there instnace.config_drive value set correctly 
when they first boot
and it will be sticky for the lifetime of the instance.

** Affects: nova
 Importance: Medium
 Assignee: sean mooney (sean-k-mooney)
 Status: Confirmed


** Tags: libvirt

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1835822

Title:
  vms loose acess to config drive with CONF.force_config_drive=True
  after hard reboot

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  The fix to bug https://bugs.launchpad.net/nova/+bug/1827492
  https://review.opendev.org/#/c/659703/8

  changed the behavior of nova.virt.configdrive.required_by
  to depend on instance.launched_at

  https://review.opendev.org/#/c/659703/8/nova/virt/configdrive.py@196

  but did not reorder
  
https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758

  as a result when nova.compute.manager._update_instance_after_spawn
  is called instance.launched_at is always set before we call 
nova.virt.configdrive.update_instance

  as a result instance.config_drive will always be set to false if not
  set on the api.

  this results in a vm that is spawned on a host with force_config_drive=True 
initally spawning
  with a config drive but loosing it after a hard reboot.

  for any deployment that uses config driver for vendor data or device
  role tagging because they do not deploy the metadata service this is a
  regressions as they cannot fall back to the metadta service.

  this also might cause issue for deployment that support the deprectated file 
injection api that is part of the v2.1 api as the files are only stored in the 
config drive and are not part metadta endoint
  note: i have not checked if we autoset instnace.config_drive when you use 
file injection or not so it may be unaffected since the breakage of the other 
support uscases is enough to justify this bug.

  
  the fix is simple jsut swap the order of 
https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758

  and then instance will have there instnace.config_drive value set correctly 
when they first boot
  and it will be sticky for the lifetime of the instance.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1835822/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1836105] Re: Instance does not start - Error during following call to agent: ovs-vsctl

2019-07-12 Thread sean mooney
can you check that the ovs-dpdk is actully working on the host.

if you do "ps aux | grep ovs" do you see the ovs-vswitchd or ovs-db
running?

if so  please run "ovs-vsctl show" and "ovs-ofctl dump-flows br-int"
to confirm your ovs is actully functional via the commandline.

im not familar with the ubuntu charms for ovs but its possible that they
configred ovs to listen on tcp only

if that is the case then you either need to configure it to work with the clis 
too
or configre os-vif and neutron to use tcp.

i dont think os-vif supports tcp in queens however.

this looks like a charms issue with how the charm deployed ovs-dpdk not
a nova bug so we proably shoudl re target this bug.

** Changed in: nova
   Importance: Undecided => Low

** Changed in: nova
   Status: New => Incomplete

** Also affects: charm-neutron-openvswitch
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1836105

Title:
  Instance does not start - Error during following call to agent: ovs-
  vsctl

Status in OpenStack neutron-openvswitch charm:
  New
Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  This is Openstack Queens on Bionic. The main difference from templates
  is no neutron-gateway (provider network only) and use of DPDK.

  There are other issues under investigation about dpdk and checksumming but 
they don't seem related to this at first look.
  https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/1833713

  - Instances cannot be started once they are shutdown
  - It's happening to every instance after the problem first appeared
  - It's happening on different hosts
  - Any try to start will timeout with errors in nova log (bellow)
  - New instances can be created and they boot ok
  - Nothing new appears in openvswitch logs with normal debugging level
  - Nothing new appears on libvirt logs for the instance (last status is from 
last boot)

  2019-07-10 13:40:42.013 19975 ERROR oslo_messaging.rpc.server
  InternalError: Failure running os_vif plugin plug method: Failed to
  plug VIF
  
VIFVHostUser(active=True,address=fa:16:3e:8e:8f:9b,has_traffic_filtering=False,id=ab6225f4-1cd8-43c7-8777-52c99ae80f67,mode='server',network=Network
  (d8249c3d-03d9-44ac-8eae-fa967993c73d),path='/run/libvirt-vhost-
  
user/vhuab6225f4-1c',plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=True,vif_name='vhuab6225f4-1c').
  Got error: Error during following call to agent: ['ovs-vsctl', '--
  timeout=120', '--', '--if-exists', 'del-port', u'vhuab6225f4-1c',
  '--', 'add-port', u'br-int', u'vhuab6225f4-1c', '--', 'set',
  'Interface', u'vhuab6225f4-1c', u'external-ids:iface-
  id=ab6225f4-1cd8-43c7-8777-52c99ae80f67', 'external-ids:iface-
  status=active', u'external-ids:attached-mac=fa:16:3e:8e:8f:9b', u
  'external-ids:vm-uuid=5e46868f-8a52-4d70-b08a-9a320dc9821b',
  'type=dpdkvhostuserclient', u'options:vhost-server-path=/run/libvirt-
  vhost-user/vhuab6225f4-1c']

  2019-07-10 13:43:05.511 19975 ERROR os_vif AgentError: Error during
  following call to agent: ['ovs-vsctl', '--timeout=120', '--', '--if-
  exists', 'del-port', u'vhuab6225f4-1c', '--', 'add-port', u'br-int',
  u'vhuab6225f4-1c', '--', 'set', 'Interface', u'vhuab6225f4-1c', u
  'external-ids:iface-id=ab6225f4-1cd8-43c7-8777-52c99ae80f67',
  'external-ids:iface-status=active', u'external-ids:attached-
  mac=fa:16:3e:8e:8f:9b', u'external-ids:vm-uuid=5e46868f-8a52-4d70
  -b08a-9a320dc9821b', 'type=dpdkvhostuserclient', u'options:vhost-
  server-path=/run/libvirt-vhost-user/vhuab6225f4-1c']

  Complete logs will follow.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1836105/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1837252] Re: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges

2019-07-23 Thread sean mooney
triaging as high as folding could lead to network disruption to guests
on multiple hosts.

i have root caused this as a result of combining the code into a single
shared codepath between the ovs and linux bridge plugin

for ovs hybrid plug we set the ageing to 0 to prevent packet loss during
live migation

https://github.com/openstack/os-
vif/commit/fa4ff64b86e6e1b6399f7250eadbee9775c22d32#diff-
f55bc78ffb4c1bbf81b88bf68673

however this is not valid for linux bridge in general
 
https://github.com/openstack/os-vif/commit/1f6fed6a69e9fd386e421f3cacae97c11cdd7c75#diff-010d1833da7ca175fffc8c41a38497c2

which replace the use of brctl in the linux bridge driver resued the
common code i introduced in

https://github.com/openstack/os-vif/commit/5027ce833c6fccaa80b5ddc8544d262c0bf99dbd#diff-
cec1a2ac6413663c344b607129c39fab

and as a result it picked up the ovs ageing code which was not
intentinal.

ill fix this shortly and backport it.

** Changed in: os-vif
   Importance: Undecided => High

** Changed in: os-vif
   Status: New => Confirmed

** Changed in: os-vif
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova
   Status: New => Invalid

** Changed in: neutron
   Status: Incomplete => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1837252

Title:
  IFLA_BR_AGEING_TIME of 0 causes flooding across bridges

Status in neutron:
  Invalid
Status in OpenStack Compute (nova):
  Invalid
Status in os-vif:
  Confirmed

Bug description:
  Release: OpenStack Stein
  Driver: LinuxBridge

  Using Stein w/ the LinuxBridge mech driver/agent, we have found that
  traffic is being flooded across bridges. Using tcpdump inside an
  instance, you can see unicast traffic for other instances.

  We have confirmed the macs table shows the aging timer set to 0 for
  permanent entries, and the bridge is NOT learning new MACs:

  root@lab-compute01:~# brctl showmacs brqd0084ac0-f7
  port no   mac addris local?   ageing timer
5   24:be:05:a3:1f:e1   yes0.00
5   24:be:05:a3:1f:e1   yes0.00
1   fe:16:3e:02:62:18   yes0.00
1   fe:16:3e:02:62:18   yes0.00
7   fe:16:3e:07:65:47   yes0.00
7   fe:16:3e:07:65:47   yes0.00
4   fe:16:3e:1d:d6:33   yes0.00
4   fe:16:3e:1d:d6:33   yes0.00
9   fe:16:3e:2b:2f:f0   yes0.00
9   fe:16:3e:2b:2f:f0   yes0.00
8   fe:16:3e:3c:42:64   yes0.00
8   fe:16:3e:3c:42:64   yes0.00
   10   fe:16:3e:5c:a6:6c   yes0.00
   10   fe:16:3e:5c:a6:6c   yes0.00
2   fe:16:3e:86:9c:dd   yes0.00
2   fe:16:3e:86:9c:dd   yes0.00
6   fe:16:3e:91:9b:45   yes0.00
6   fe:16:3e:91:9b:45   yes0.00
   11   fe:16:3e:b3:30:00   yes0.00
   11   fe:16:3e:b3:30:00   yes0.00
3   fe:16:3e:dc:c3:3e   yes0.00
3   fe:16:3e:dc:c3:3e   yes0.00

  root@lab-compute01:~# bridge fdb show | grep brqd0084ac0-f7
  01:00:5e:00:00:01 dev brqd0084ac0-f7 self permanent
  fe:16:3e:02:62:18 dev tap74af38f9-2e master brqd0084ac0-f7 permanent
  fe:16:3e:02:62:18 dev tap74af38f9-2e vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:86:9c:dd dev tapb00b3c18-b3 master brqd0084ac0-f7 permanent
  fe:16:3e:86:9c:dd dev tapb00b3c18-b3 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:dc:c3:3e dev tap7284d235-2b master brqd0084ac0-f7 permanent
  fe:16:3e:dc:c3:3e dev tap7284d235-2b vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:1d:d6:33 dev tapbeb9441a-99 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:1d:d6:33 dev tapbeb9441a-99 master brqd0084ac0-f7 permanent
  24:be:05:a3:1f:e1 dev eno1.102 vlan 1 master brqd0084ac0-f7 permanent
  24:be:05:a3:1f:e1 dev eno1.102 master brqd0084ac0-f7 permanent
  fe:16:3e:91:9b:45 dev tapc8ad2cec-90 master brqd0084ac0-f7 permanent
  fe:16:3e:91:9b:45 dev tapc8ad2cec-90 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:07:65:47 dev tap86e2c412-24 master brqd0084ac0-f7 permanent
  fe:16:3e:07:65:47 dev tap86e2c412-24 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:3c:42:64 dev tap37bcb70e-9e master brqd0084ac0-f7 permanent
  fe:16:3e:3c:42:64 dev tap37bcb70e-9e vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d master brqd0084ac0-f7 permanent
  fe:16:3e:b3:30:00 dev tap6548bacb-c0 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:b3:30:00 dev tap6548bacb-c0 master brqd0084ac0-f7 permanent
  fe:16:3e:5c:a6:6c dev

[Yahoo-eng-team] [Bug 1837252] Re: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges

2019-07-25 Thread sean mooney
** Also affects: os-vif/stein
   Importance: Undecided
   Status: New

** Also affects: os-vif/trunk
   Importance: High
 Assignee: sean mooney (sean-k-mooney)
   Status: In Progress

** Changed in: os-vif/stein
   Status: New => Confirmed

** Changed in: os-vif/stein
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: os-vif/stein
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1837252

Title:
  IFLA_BR_AGEING_TIME of 0 causes flooding across bridges

Status in neutron:
  Invalid
Status in OpenStack Compute (nova):
  Invalid
Status in os-vif:
  In Progress
Status in os-vif stein series:
  Confirmed
Status in os-vif trunk series:
  In Progress
Status in OpenStack Security Advisory:
  Incomplete

Bug description:
  Release: OpenStack Stein
  Driver: LinuxBridge

  Using Stein w/ the LinuxBridge mech driver/agent, we have found that
  traffic is being flooded across bridges. Using tcpdump inside an
  instance, you can see unicast traffic for other instances.

  We have confirmed the macs table shows the aging timer set to 0 for
  permanent entries, and the bridge is NOT learning new MACs:

  root@lab-compute01:~# brctl showmacs brqd0084ac0-f7
  port no   mac addris local?   ageing timer
5   24:be:05:a3:1f:e1   yes0.00
5   24:be:05:a3:1f:e1   yes0.00
1   fe:16:3e:02:62:18   yes0.00
1   fe:16:3e:02:62:18   yes0.00
7   fe:16:3e:07:65:47   yes0.00
7   fe:16:3e:07:65:47   yes0.00
4   fe:16:3e:1d:d6:33   yes0.00
4   fe:16:3e:1d:d6:33   yes0.00
9   fe:16:3e:2b:2f:f0   yes0.00
9   fe:16:3e:2b:2f:f0   yes0.00
8   fe:16:3e:3c:42:64   yes0.00
8   fe:16:3e:3c:42:64   yes0.00
   10   fe:16:3e:5c:a6:6c   yes0.00
   10   fe:16:3e:5c:a6:6c   yes0.00
2   fe:16:3e:86:9c:dd   yes0.00
2   fe:16:3e:86:9c:dd   yes0.00
6   fe:16:3e:91:9b:45   yes0.00
6   fe:16:3e:91:9b:45   yes0.00
   11   fe:16:3e:b3:30:00   yes0.00
   11   fe:16:3e:b3:30:00   yes0.00
3   fe:16:3e:dc:c3:3e   yes0.00
3   fe:16:3e:dc:c3:3e   yes0.00

  root@lab-compute01:~# bridge fdb show | grep brqd0084ac0-f7
  01:00:5e:00:00:01 dev brqd0084ac0-f7 self permanent
  fe:16:3e:02:62:18 dev tap74af38f9-2e master brqd0084ac0-f7 permanent
  fe:16:3e:02:62:18 dev tap74af38f9-2e vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:86:9c:dd dev tapb00b3c18-b3 master brqd0084ac0-f7 permanent
  fe:16:3e:86:9c:dd dev tapb00b3c18-b3 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:dc:c3:3e dev tap7284d235-2b master brqd0084ac0-f7 permanent
  fe:16:3e:dc:c3:3e dev tap7284d235-2b vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:1d:d6:33 dev tapbeb9441a-99 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:1d:d6:33 dev tapbeb9441a-99 master brqd0084ac0-f7 permanent
  24:be:05:a3:1f:e1 dev eno1.102 vlan 1 master brqd0084ac0-f7 permanent
  24:be:05:a3:1f:e1 dev eno1.102 master brqd0084ac0-f7 permanent
  fe:16:3e:91:9b:45 dev tapc8ad2cec-90 master brqd0084ac0-f7 permanent
  fe:16:3e:91:9b:45 dev tapc8ad2cec-90 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:07:65:47 dev tap86e2c412-24 master brqd0084ac0-f7 permanent
  fe:16:3e:07:65:47 dev tap86e2c412-24 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:3c:42:64 dev tap37bcb70e-9e master brqd0084ac0-f7 permanent
  fe:16:3e:3c:42:64 dev tap37bcb70e-9e vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d master brqd0084ac0-f7 permanent
  fe:16:3e:b3:30:00 dev tap6548bacb-c0 vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:b3:30:00 dev tap6548bacb-c0 master brqd0084ac0-f7 permanent
  fe:16:3e:5c:a6:6c dev tap61107236-1e vlan 1 master brqd0084ac0-f7 permanent
  fe:16:3e:5c:a6:6c dev tap61107236-1e master brqd0084ac0-f7 permanent

  The ageing time for the bridge is set to 0:

  root@lab-compute01:~# brctl showstp brqd0084ac0-f7
  brqd0084ac0-f7
   bridge id8000.24be05a31fe1
   designated root  8000.24be05a31fe1
   root port   0path cost  0
   max age20.00 bridge max age20.00
   hello time  2.00 bridge hello time  2.00
   forward delay   0.00 bridge forward delay
   0.00
   ageing time 0.00
   hello timer

[Yahoo-eng-team] [Bug 1825584] Re: eventlet monkey-patching breaks AMQP heartbeat on uWSGI

2020-07-08 Thread sean mooney
** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Changed in: nova/ussuri
   Status: New => Fix Released

** Changed in: nova/train
   Status: New => In Progress

** Changed in: nova/train
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/ussuri
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/stein
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/stein
   Status: New => In Progress

** Changed in: nova
   Importance: Undecided => Low

** Changed in: nova/stein
   Importance: Undecided => Low

** Changed in: nova/train
   Importance: Undecided => Low

** Changed in: nova/ussuri
   Importance: Undecided => Low

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1825584

Title:
  eventlet monkey-patching breaks AMQP heartbeat on uWSGI

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) stein series:
  In Progress
Status in OpenStack Compute (nova) train series:
  In Progress
Status in OpenStack Compute (nova) ussuri series:
  Fix Released

Bug description:
  Stein nova-api running under uWSGI presents an AMQP issue. The first
  API call that requires RPC creates an AMQP connection and successfully
  completes. Normally regular heartbeats would be sent from this point
  on, to maintain the connection. This is not happening. After a few
  minutes, the AMQP server (rabbitmq, in my case) notices that there
  have been no heartbeats, and drops the connection. A later nova API
  call that requires RPC tries to use the old connection, and throws a
  "connection reset by peer" exception and the API call fails. A
  mailing-list response suggests that this is affecting mod_wsgi also:

  http://lists.openstack.org/pipermail/openstack-
  discuss/2019-April/005310.html

  I've discovered that this problem seems to be caused by eventlet
  monkey-patching, which was introduced in:

  
https://github.com/openstack/nova/commit/23ba1c690652832c655d57476630f02c268c87ae

  It was later rearranged in:

  
https://github.com/openstack/nova/commit/3c5e2b0e9fac985294a949852bb8c83d4ed77e04

  but this problem remains.

  If I comment out the import of nova.monkey_patch in
  nova/api/openstack/__init__.py the problem goes away.

  Seems that eventlet monkey-patching and uWSGI are not getting along
  for some reason...

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1825584/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1879878] Re: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node

2020-07-08 Thread sean mooney
http://paste.openstack.org/show/795679/
i was able to repoduce this once on master but not reliably yet. 
so im moving this to confimed.

we also have a downstream report of this on train
https://bugzilla.redhat.com/show_bug.cgi?id=1850400
sill at that to the affeted versions

i am setting the importance to medium as this seams to be quite hard to
trigger as all but 1 out of the 10-12 attempts i made failed so i think
this will be hit rarely.

when this happens the vm is left in a running state runing on the target
host. stopping the vm and starting it restores it to an active state.

** Bug watch added: Red Hat Bugzilla #1850400
   https://bugzilla.redhat.com/show_bug.cgi?id=1850400

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
   Status: Incomplete => Confirmed

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Changed in: nova/ussuri
   Status: New => Confirmed

** Changed in: nova/train
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1879878

Title:
  VM become Error after confirming resize with Error info
  CPUUnpinningInvalid on source node

Status in OpenStack Compute (nova):
  Confirmed
Status in OpenStack Compute (nova) train series:
  Confirmed
Status in OpenStack Compute (nova) ussuri series:
  Confirmed

Bug description:
  Description
  ===

  In my environmet, it will take some time to clean VM on source node in 
confirming resize.
  during confirming resize process, periodic_task update_available_resource may 
update resource usage at the same time.
  It may cause ERROR like: 
  CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of 
pinned CPU set []
  during confirming resize process.

  
   

  Steps to reproduce
  ==
  * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's 
say 30 seconds on compute nodes. This step will increase the probability of 
error.

  * create a "dedicated" VM, the flavor can be
  ++--+
  | Property   | Value|
  ++--+
  | OS-FLV-DISABLED:disabled   | False|
  | OS-FLV-EXT-DATA:ephemeral  | 0|
  | disk   | 80   |
  | extra_specs| {"hw:cpu_policy": "dedicated"}   |
  | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 |
  | name   | 4vcpu.4mem.80ssd.0eph.numa   |
  | os-flavor-access:is_public | True |
  | ram| 4096 |
  | rxtx_factor| 1.0  |
  | swap   |  |
  | vcpus  | 4|
  ++--+

  * Resize the VM with a new flavor to another node.

  * Confirm resize. 
  Make sure it will take some time to undefine the vm on source node, 30 
seconds will lead to inevitable results.  

  * Then you will see the ERROR notice on dashboard, And the VM become
  ERROR

  
  Expected result
  ===
  VM resized successfuly, vm state is active

  
  Actual result
  =

  * VM become ERROR

  * On dashboard you can see this notice:
  Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a 
subset of pinned CPU set []].


  Environment
  ===
  1. Exact version of OpenStack you are running.

Newton version with patch https://review.opendev.org/#/c/641806/21
I am sure it will happen to other new vision with 
https://review.opendev.org/#/c/641806/21
such as Train and Ussuri

  2. Which hypervisor did you use?
 Libvirt + KVM

  3. Which storage type did you use?
 local disk

  4. Which networking type did you use?
 Neutron with OpenVSwitch

  Logs & Configs
  ==

  ERROR log on source node
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager 
[req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 
95262f1f45f14170b91cd8054bb36512 - - -] [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last):
  2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 
993138d6-4b80-4b19-81c1-a16dbc6e196c]   File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in 
_error_out_instance_on_

[Yahoo-eng-team] [Bug 1887377] [NEW] nova does not loadbalance asignmnet of resources on a host based on avaiablity of pci device, hugepages or pcpus.

2020-07-13 Thread sean mooney
Public bug reported:

Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long
time. since its introduction the advice has always been to create a flavor that 
mimic your
typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes the 
you should create
flavor that request 2 numa nodes. for along time operators have ignored this 
advice
and continued to create singel numa node flavor sighting that after 5+ year of 
hardware venders
working with VNF vendor to make there product numa aware, vnf often still do 
not optimize
properly for a multi numa environment.

as a result many operator still deploy single numa vms although that is
becoming less common over time.  when you deploy a vm with a single numa
node today we more or less iterate over the host numa node in order and
assign the vm to the first numa nodes where it fits. on a host without
any pci devices whitelisted for openstack management this behvaior
result in numa nodes being filled linerally form numa 0 to numa n. that
mean if a host had 100G of hugepage on both numa node 0 and 1 and you
schduled 101 1G singel numa vms to the host, 100 vm would spawn on numa0
and 1 vm would spwan on numa node 1.

that means that the first 100 vms would all contened for cpu resouces on
the first numa node while the last vm had all of the secound numa ndoe
to its own use.

the correct behavior woudl be for nova to round robin asign the vms
attepmetin to keep the resouce avapiableity  blanced. this will
maxiumise performance for indivigual vms while pessimisng the schduling
of large vms on a host.

to this end a new numa blancing config option (unset, pack or spread)
should be added and we should sort numa nodes in decending(spread) or
acending(pack) order based on pMEM, pCPUs, mempages and pci devices in
that sequence.

in future release when numa is in placment this sorting will need to be
done in a weigher that sorts the allocation caindiates based on the same
pack/spread cirtira.

i am filing this as a bug not a feature as this will have a significant
impact for existing deployment that either expected
https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented
/reserve-numa-with-pci.html to implement this logic already or who do
not follow our existing guidance on creating flavor that align to the
host topology.

** Affects: nova
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
 Status: New


** Tags: numa

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1887377

Title:
  nova does not loadbalance asignmnet of resources on a host based on
  avaiablity of pci device, hugepages or pcpus.

Status in OpenStack Compute (nova):
  New

Bug description:
  Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long
  time. since its introduction the advice has always been to create a flavor 
that mimic your
  typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes 
the you should create
  flavor that request 2 numa nodes. for along time operators have ignored this 
advice
  and continued to create singel numa node flavor sighting that after 5+ year 
of hardware venders
  working with VNF vendor to make there product numa aware, vnf often still do 
not optimize
  properly for a multi numa environment.

  as a result many operator still deploy single numa vms although that
  is becoming less common over time.  when you deploy a vm with a single
  numa node today we more or less iterate over the host numa node in
  order and assign the vm to the first numa nodes where it fits. on a
  host without any pci devices whitelisted for openstack management this
  behvaior result in numa nodes being filled linerally form numa 0 to
  numa n. that mean if a host had 100G of hugepage on both numa node 0
  and 1 and you schduled 101 1G singel numa vms to the host, 100 vm
  would spawn on numa0 and 1 vm would spwan on numa node 1.

  that means that the first 100 vms would all contened for cpu resouces
  on the first numa node while the last vm had all of the secound numa
  ndoe to its own use.

  the correct behavior woudl be for nova to round robin asign the vms
  attepmetin to keep the resouce avapiableity  blanced. this will
  maxiumise performance for indivigual vms while pessimisng the
  schduling of large vms on a host.

  to this end a new numa blancing config option (unset, pack or spread)
  should be added and we should sort numa nodes in decending(spread) or
  acending(pack) order based on pMEM, pCPUs, mempages and pci devices in
  that sequence.

  in future release when numa is in placment this sorting will need to
  be done in a weigher that sorts the allocation caindiates based on the
  same pack/spread cirtira.

  i am filing this as a bug not a feature as this will have a
  significant impact for existing deployment that either

[Yahoo-eng-team] [Bug 1885558] Re: sriov: instance with macvtap vnic_type live migration failed

2020-07-27 Thread sean mooney
** Changed in: nova
   Importance: Undecided => Medium

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Changed in: nova/train
   Importance: Undecided => Medium

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/ussuri
   Importance: Undecided => Medium

** Changed in: nova/ussuri
   Status: New => Triaged

** Tags added: live-migration pci

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1885558

Title:
  sriov:  instance with macvtap vnic_type live migration failed

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:

  Instance with the vnic_type macvtap port live migration failed.

  My env configuration follow the document:
  https://docs.openstack.org/neutron/latest/admin/config-sriov.html

  
  # VFs on source comptue
  84:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
Network Connection (rev 01)
  84:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  84:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)

  # VFs on dest compute
  81:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ 
Network Connection (rev 01)
  81:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
  81:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)


  # port create CLI
  openstack port create --network $net_id --vnic-type macvtap macvtap01

  # boot instance with macvtap port
  ova boot --flavor 2C4G --nic port-id=$(neutron port-show macvtap01 -f value 
-c id) --block-device 
source=image,id=${image},dest=volume,size=100,shutdown=preserve,bootindex=0 
--availability-zone nova:ctrl01.srvio.dev vm01_ctrl_macvtap

  # live migration failed

  nova live-migration vm01_ctrl_macvtap

  Source compute node log:

  /var/log/nova-compute.log
  2020-06-28 03:51:52.446 806489 DEBUG nova.virt.libvirt.vif [-] 
vif_type=hw_veb 
instance=Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='nova',cell_name=None,cleaned=False,config_drive='True',created_at=2020-06-23T03:35:55Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=,disable_terminate=False,display_description=None,display_name='vm01_ctrl_macvtap',ec2_ids=,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,flavor=Flavor(1),host='ctrl01.srvio.dev',hostname='vm01-ctrl-macvtap',id=91,image_ref='',info_cache=InstanceInfoCache,instance_type_id=1,kernel_id='',key_data=None,key_name=None,keypairs=,launch_index=0,launched_at=2020-06-23T03:36:15Z,launched_on='ctrl01.srvio.dev',locked=False,locked_by=None,memory_mb=4096,metadata={},migration_context=,new_flavor=None,node='ctrl01.srvio.dev',numa_topology=None,old_flavor=None,os_type=None,pci_devices=,pci_requests=InstancePCIRequests,power_state=1,progress=0,project_id='b36d8472f55e4fe88f8af98fe2c0ad8c',ramdisk_id='',reservation_id='r-j7a6v3fv',root_device_name='/dev/vda',root_gb=50,security_groups=SecurityGroupList,services=,shutdown_terminate=False,system_metadata={boot_roles='reader,member,admin',image_base_image_ref='',image_container_format='bare',image_disk_format='raw',image_hw_qemu_guest_agent='yes',image_min_disk='50',image_min_ram='0',owner_project_name='admin',owner_user_name='admin'},tags=,task_state='migrating',terminat

[Yahoo-eng-team] [Bug 1889633] Re: Pinned instance with thread policy can consume VCPU

2020-07-30 Thread sean mooney
this has a signicant upgrade impact so i think this is imporant to fix and 
backport.
i have repoduced this locally too so moveing to triaged.

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
   Status: New => Triaged

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Changed in: nova/train
   Importance: Undecided => High

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/ussuri
   Importance: Undecided => High

** Changed in: nova/ussuri
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1889633

Title:
  Pinned instance with thread policy can consume VCPU

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:
  In Train, we introduced the concept of the 'PCPU' resource type to
  track pinned instance CPU usage. The '[compute] cpu_dedicated_set' is
  used to indicate which host cores should be used by pinned instances
  and, once this config option was set, nova would start reporting
  'PCPU' resource types in addition to (or entirely instead of, if
  'cpu_shared_set' was unset) 'VCPU'. Requests for pinned instances (via
  the 'hw:cpu_policy=dedicated' flavor extra spec or equivalent image
  metadata property) would result in a query for 'PCPU' inventory rather
  than 'VCPU', as previously done.

  We anticipated some upgrade issues with this change, whereby there
  could be a period during an upgrade in which some hosts would have the
  new configuration, meaning they'd be reporting PCPU, but the remainder
  would still be on legacy config and therefore would continue reporting
  just VCPU. An instance could be reasonably expected to land on any
  host, but since only the hosts with the new configuration were
  reporting 'PCPU' inventory and the 'hw:cpu_policy=dedicated' extra
  spec was resulting in a request for 'PCPU', the hosts with legacy
  configuration would never be consumed.

  We worked around this issue by adding support for a fallback placement
  query, enabled by default, which would make a second request using
  'VCPU' inventory instead of 'PCPU'. The idea behind this was that the
  hosts with 'PCPU' inventory would be preferred, meaning we'd only try
  the 'VCPU' allocation if the preferred path failed. Crucially, we
  anticipated that if a host with new style configuration was picked up
  by this second 'VCPU' query, an instance would never actually be able
  to build there. This is because the new-style configuration would be
  reflected in the 'numa_topology' blob of the 'ComputeNode' object,
  specifically via the 'cpuset' (for cores allocated to 'VCPU') and
  'pcpuset' (for cores allocated to 'PCPU') fields. With new-style
  configuration, both of these are set to unique values. If the
  scheduler had determined that there wasn't enough 'PCPU' inventory
  available for the instance, that would implicitly mean there weren't
  enough of the cores listed in the 'pcpuset' field still available.

  Turns out there's a gap in this thinking: thread policies. The
  'isolate' CPU thread policy previously meant "give me a host with no
  hyperthreads, else a host with hyperthreads but mark the thread
  siblings of the cores used by the instance as reserved". This didn't
  translate to a new 'PCPU' world where we needed to know how many cores
  we were consuming up front before landing on the host. To work around
  this, we removed support for the latter case and instead relied on a
  trait, 'HW_CPU_HYPERTHEADING', to indicate whether a host had
  hyperthread support or not. Using the 'isolate' policy meant that
  trait could not be defined on the host, or the trait was "forbidden".
  The gap comes via a combination of this trait request and the fallback
  query. If we request the isolate thread policy, hosts with new-style
  configuration and sufficient PCPU inventory would nonetheless be
  rejected if they reported the 'HW_CPU_HYPERTHEADING' trait. However,
  these could get picked up in the fallback query and the instance would
  not fail to build on the host because of lack of 'PCPU' inventory.
  This means we end up with a pinned instance on a host using new-style
  configuration that is consuming 'VCPU' inventory. Boo.

  # Steps to reproduce

  1. Using a host with hyperthreading support enabled, configure both
  '[compute] cpu_dedicated_set' and '[compute] cpu_shared_set'

  2. Boot an instance with the 'hw:cpu_thread_policy=isolate' extra
  spec.

  # Expected result

  Instance should not boot since the host has hyperthreads.

  # Actual result

  Instance boots.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1889633/+

[Yahoo-eng-team] [Bug 1883671] Re: [SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info

2020-08-06 Thread sean mooney
reading the nic feature flags was intoduced in pike 
https://github.com/openstack/nova/commit/e6829f872aca03af6181557260637c8b601e476a

but this only seams to happen on mondern version of libvirt so setting
as wont fix. it can be backported if someone hits the issue and care to
do so

** Also affects: nova/pike
   Importance: Undecided
   Status: New

** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Changed in: nova/pike
   Status: New => Won't Fix

** Changed in: nova/queens
   Status: New => Won't Fix

** Changed in: nova/rocky
   Status: New => Won't Fix

** Changed in: nova/stein
   Status: New => Triaged

** Changed in: nova/stein
   Status: Triaged => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1883671

Title:
  [SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) pike series:
  Won't Fix
Status in OpenStack Compute (nova) queens series:
  Won't Fix
Status in OpenStack Compute (nova) rocky series:
  Won't Fix
Status in OpenStack Compute (nova) stein series:
  Won't Fix
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:
  Nova periodically updates the available resources per hypervisor [1].
  That implies the reporting of the PCI devices [2]->[3].

  In [4], a new feature was introduced to read from libvirt the NIC
  capabilities (gso, tso, tx, etc.). But when the NIC interface is bound
  to the VM and the MAC address is not the one assigned by the driver
  (Nova changes the MAC address according to the info provided by
  Neutron), libvirt fails reading the non-existing device:
  http://paste.openstack.org/show/794799/.

  This command should be avoided or at least, if the executing fails,
  the exception could be hidden.

  
  [1]https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9642
  
[2]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6980
  
[3]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6898
  [4]Ia5b6abbbf4e5f762e0df04167c32c6135781d305

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1883671/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1888395] Re: shared live migration of a vm with a vif is broken in train

2020-08-19 Thread sean mooney
** Also affects: networking-opencontrail
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1888395

Title:
  shared live migration of a vm with a vif is broken in train

Status in networking-opencontrail:
  New
Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  it was working in queens but fails in train. nova compute at the
  target aborts with the exception:

  Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 
165, in _process_incoming
  res = self.dispatcher.dispatch(message)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 274, in dispatch
  return self._do_dispatch(endpoint, method, ctxt, args)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 194, in _do_dispatch
  result = func(ctxt, **new_args)
File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, 
in wrapped
  function_name, call_dict, binary, tb)
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__
  self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)
File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, 
in wrapped
  return f(self, context, *args, **kw)
File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1372, 
in decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 219, 
in decorated_function
  kwargs['instance'], e, sys.exc_info())
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)  File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 207, in 
decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7007, 
in pre_live_migration
  bdm.save()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__
  self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6972, 
in pre_live_migration
  migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
9190, in pre_live_migration
  instance, network_info, migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
9071, in _pre_live_migration_plug_vifs
  vif_plug_nw_info.append(migrate_vif.get_dest_vif())
File "/usr/lib/python2.7/site-packages/nova/objects/migrate_data.py", line 
90, in get_dest_vif
  vif['type'] = self.vif_type
File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 
67, in getter
  self.obj_load_attr(name)
File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 
603, in obj_load_attr
  _("Cannot load '%s' in the base class") % attrname)
  NotImplementedError: Cannot load 'vif_type' in the base class

  
  steps to reproduce:
  - train centos 7 based deployment: 1 controller, 2 computes, libvirt + 
qemu-kvm, ceph shared storage, neutron with contrail vrouter virtual network;
  - create and start a vm;
  - live migrate it between computes.

  expected result: vm migrates successfully.

  
  rpm -qa | grep nova:

  python2-novaclient-15.1.1-1.el7.noarch
  openstack-nova-common-20.3.0-1.el7.noarch
  python2-nova-20.3.0-1.el7.noarch
  openstack-nova-compute-20.3.0-1.el7.noarch

To manage notifications about this bug go to:
https://bugs.launchpad.net/networking-opencontrail/+bug/1888395/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1888395] Re: shared live migration of a vm with a vif is broken in train

2020-08-20 Thread sean mooney
moving this to triaaged and setting this to high
the regression was introduced in train by
https://opendev.org/openstack/nova/commit/fd8fdc934530fb49497bc6deaa72adfa51c8783a
specifically 
https://github.com/openstack/nova/blob/b8ca3ce31ca15ddaa18512271c2de76835f908bb/nova/compute/manager.py#L7654-L7656

adding

  migrate_data.vifs = \
migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(
instance.get_network_info())

uncondtionally activates the code path that require multiple port bindings
as when support for the multiple port bindings was added in rocky it used 
migrate_data.vif as a sentel
for the new workflow.

e.g. if it is populated the new migration workflow should be used.

  migrate_data.vifs = \
migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(
instance.get_network_info())

should be

if self.network_api.supports_port_binding_extension(ctxt):
migrate_data.vifs = 
migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(instance.get_network_info())

this bug prevents live migation with any neutron backend that does not
support multiple port bindigns form train on so i am setting this to
high.

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
   Status: Incomplete => Triaged

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/train
   Importance: Undecided => High

** Changed in: nova/ussuri
   Status: New => Triaged

** Changed in: nova/ussuri
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1888395

Title:
  shared live migration of a vm with a vif is broken in train

Status in networking-opencontrail:
  New
Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:
  it was working in queens but fails in train. nova compute at the
  target aborts with the exception:

  Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 
165, in _process_incoming
  res = self.dispatcher.dispatch(message)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 274, in dispatch
  return self._do_dispatch(endpoint, method, ctxt, args)
File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 194, in _do_dispatch
  result = func(ctxt, **new_args)
File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, 
in wrapped
  function_name, call_dict, binary, tb)
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__
  self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)
File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, 
in wrapped
  return f(self, context, *args, **kw)
File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1372, 
in decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 219, 
in decorated_function
  kwargs['instance'], e, sys.exc_info())
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)  File 
"/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 207, in 
decorated_function
  return function(self, context, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7007, 
in pre_live_migration
  bdm.save()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, 
in __exit__
  self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, 
in force_reraise
  six.reraise(self.type_, self.value, self.tb)
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6972, 
in pre_live_migration
  migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
9190, in pre_live_migration
  instance, network_info, migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 
9071, in _pre_live_migration_plug_vifs
  vif_plug_nw_info.append(migrate_vif.get_dest_vif())
File "/usr/lib/python2.7/site-packages/nova/objects/migrate_data.py", line 
90, in get_dest_vif
  vif['type']

[Yahoo-eng-team] [Bug 1893121] [NEW] nova does not balance vm across numa node or prefer numa node with pci device when one is requested

2020-08-26 Thread sean mooney
Public bug reported:

the current implementation of numa has evolved over the years to support
pci affinity policyes and numa affinity for other deivces like pmem.

when numa was first intoduced the recomenation was to match the virtual
numa toplogy of a guest to the numa toplogy of the host for best
performance.

in such a configuration the guest cpus and memory are evenly distubuted across 
the host numa nodes
meanin that the memroy contoler and phsyicall cpus are consumed evenly.
i.e. all vms do not use the cores form oh host numa node.

if you create a vm with only hw:numa_nodes set and no other numa
requests however due to how we currently iterage over host numa cells in
a deterministic order the all vms will be placeed on numa node 0.

if other vms also request numa resource like pinned cpus 
hw:cpu_policy=dedicated or explict pages size 
hw:mem_page_size then the consumption of those resource 
will eventually
cause those vms to loadblance onto the other numa nodes.

as a reuslt the current behavior is to fill the first numa node before
ever using resouces form the rest for numa vms using cpu pinnign or
hugepages but numa vms that only request hw:numa_nodes wont be
loadblanced.

in both case this is suboptimal as it resulting in lower utilisation of the 
host hardware as
the second and subsequent numa nodes will not be used untill the first numa 
node is full when using pinning and huge pages and will never be used for numa 
instance that dont request other numa resources.

in a similar vain

https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/reserve-numa-with-pci.html
partly implemented a preferential sorting of host with pci devices.

if the vm  did not request a pci device we weighter host with pcidevice
lowere vs host without them.

https://github.com/openstack/nova/blob/20459e3e88cb8382d450c7fdb042e2016d5560c5/nova/virt/hardware.py#L2268-L2275


a full implemation would have on the selected host prefer putting the vms on 
numa nodes that had a pci device.

as a result if a host has 2 numa nodes and the vm request a pci deice and 1 
numa node.
if the vm will fit on the first numa node (node0) and has the prefered police 
for pci affintiy we wont check or use the second numa node (node1).


the fix for this is triviail add an else clause

 # If PCI device(s) are not required, prefer host cells that don't have
# devices attached. Presence of a given numa_node in a PCI pool is
# indicative of a PCI device being associated with that node
if not pci_requests and pci_stats:
# TODO(stephenfin): pci_stats can't be None here but mypy can't figure
# that out for some reason
host_cells = sorted(host_cells, key=lambda cell: cell.id in [
pool['numa_node'] for pool in pci_stats.pools])  # type: ignore

becomes

 # If PCI device(s) are not required, prefer host cells that don't have
# devices attached. Presence of a given numa_node in a PCI pool is
# indicative of a PCI device being associated with that node
if not pci_requests and pci_stats:
# TODO(stephenfin): pci_stats can't be None here but mypy can't figure
# that out for some reason
host_cells = sorted(host_cells, key=lambda cell: cell.id in [
pool['numa_node'] for pool in pci_stats.pools])  # type: ignore
 else:
host_cells = sorted(host_cells, key=lambda cell: cell.id in [
pool['numa_node'] for pool in pci_stats.pools], reverse=True)  # 
type: ignore
 
or more compactly


 # If PCI device(s) are not required, prefer host cells that don't have
# devices attached. Presence of a given numa_node in a PCI pool is
# indicative of a PCI device being associated with that node
reverse =  pci_requests and pci_stats:
# TODO(stephenfin): pci_stats can't be None here but mypy can't figure
# that out for some reason
host_cells = sorted(host_cells, key=lambda cell: cell.id in [
pool['numa_node'] for pool in pci_stats.pools], reverse=reverse)  # 
type: ignore


since python support stable sort orders complex sort can be achcive by
multiple stables sorts

https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-
sorts

so we can also adress the numa blanceing issue by 
first sorting by instance per numa node
then sorting by free memory per numa node
then by cpus per numa node and finally 
by pci device per numa node.

this will allow nova to evenly distubtue vms optimally per numa node and
also fully support the preference aspect of the preferred sriov numa
affinity policy which currenlty only select a host that is capable of
provideing numa affintiy but does not actully pferfer the numa node when
we boot the vm.

this bug applies to all currently supported release of nova.

** Affects: nova
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
 Status: Confirmed


** T

[Yahoo-eng-team] [Bug 1893148] [NEW] libvirt.libvirtError: Domain not found: no domain with matching uuid

2020-08-26 Thread sean mooney
Public bug reported:


seen in upstream ci in grenade multi node job

as part of a livemigation. 
in this canse the error happens on the destination host.

Traceback (most recent call last):

   File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 605, in
_get_domain

 return conn.lookupByUUIDString(instance.uuid)

   File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line
190, in doit

 result = proxy_call(self._autowrap, f, *args, **kwargs)

   File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line
148, in proxy_call

 rv = execute(f, *args, **kwargs)

   File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line
129, in execute

 six.reraise(c, e, tb)

   File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in
reraise

 raise value

   File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line
83, in tworker

 rv = meth(*args, **kwargs)

   File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 4151,
in lookupByUUIDString

 if ret is None:raise libvirtError('virDomainLookupByUUIDString()
failed', conn=self)

 libvirt.libvirtError: Domain not found: no domain with matching uuid
'386113bf-cca1-438a-9ab5-4714c147bbfc'


 During handling of the above exception, another exception occurred:


 Traceback (most recent call last):

   File "/opt/stack/old/nova/nova/compute/manager.py", line 8005, in
_do_pre_live_migration_from_source

 instance, block_device_info=block_device_info)

   File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9934, in
get_instance_disk_info

 self._get_instance_disk_info(instance, block_device_info))

   File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9915, in
_get_instance_disk_info

 guest = self._host.get_guest(instance)

   File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 589, in
get_guest

 return libvirt_guest.Guest(self._get_domain(instance))

   File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 609, in
_get_domain

 raise exception.InstanceNotFound(instance_id=instance.uuid)

 nova.exception.InstanceNotFound: Instance 386113bf-cca1-438a-
9ab5-4714c147bbfc could not be found.


this seam similar to 
https://bugs.launchpad.net/nova/+bug/1662626 but its not really clear why this 
fails

see 
https://zuul.opendev.org/t/openstack/build/cfda29fa579544e481c803c4c5de51fb/log/logs/subnode-2/screen-n-cpu.txt#9697-9729
of https://zuul.opendev.org/t/openstack/build/cfda29fa579544e481c803c4c5de51fb/ 
for full logs

** Affects: nova
 Importance: Medium
 Status: Triaged


** Tags: libvirt live-migration

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1893148

Title:
  libvirt.libvirtError: Domain not found: no domain with matching uuid

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  
  seen in upstream ci in grenade multi node job

  as part of a livemigation. 
  in this canse the error happens on the destination host.

  Traceback (most recent call last):

 File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 605, in
  _get_domain

   return conn.lookupByUUIDString(instance.uuid)

 File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py",
  line 190, in doit

   result = proxy_call(self._autowrap, f, *args, **kwargs)

 File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py",
  line 148, in proxy_call

   rv = execute(f, *args, **kwargs)

 File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py",
  line 129, in execute

   six.reraise(c, e, tb)

 File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in
  reraise

   raise value

 File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py",
  line 83, in tworker

   rv = meth(*args, **kwargs)

 File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line
  4151, in lookupByUUIDString

   if ret is None:raise libvirtError('virDomainLookupByUUIDString()
  failed', conn=self)

   libvirt.libvirtError: Domain not found: no domain with matching uuid
  '386113bf-cca1-438a-9ab5-4714c147bbfc'


   During handling of the above exception, another exception occurred:

  
   Traceback (most recent call last):

 File "/opt/stack/old/nova/nova/compute/manager.py", line 8005, in
  _do_pre_live_migration_from_source

   instance, block_device_info=block_device_info)

 File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9934,
  in get_instance_disk_info

   self._get_instance_disk_info(instance, block_device_info))

 File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9915,
  in _get_instance_disk_info

   guest = self._host.get_guest(instance)

 File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 589, in
  get_guest

   return libvirt_guest.Guest(self._get_domain(instance))

 File "/o

[Yahoo-eng-team] [Bug 1895063] Re: Allow rescue volume backed instance

2020-09-10 Thread sean mooney
This is a feature not a bug.

there is already a blueprint open for thsi so markin this as invlaid.

https://blueprints.launchpad.net/nova/+spec/volume-backed-server-rebuild


** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1895063

Title:
  Allow rescue volume backed instance

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Should we offer support for volume backed instance?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1895063/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1892361] Re: SRIOV instance gets type-PF interface, libvirt kvm fails

2020-09-11 Thread sean mooney
** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Changed in: nova/queens
   Status: New => Confirmed

** Changed in: nova/queens
   Importance: Undecided => Medium

** Changed in: nova/rocky
   Importance: Undecided => Medium

** Changed in: nova/rocky
   Status: New => Triaged

** Changed in: nova/queens
   Status: Confirmed => Triaged

** Changed in: nova/stein
   Importance: Undecided => Medium

** Changed in: nova/stein
   Status: New => Triaged

** Changed in: nova/train
   Importance: Undecided => Medium

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/ussuri
   Importance: Undecided => Medium

** Changed in: nova/ussuri
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1892361

Title:
  SRIOV instance gets type-PF interface, libvirt kvm fails

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged

Bug description:
  When spawning an SR-IOV enabled instance on a newly deployed host,
  nova attempts to spawn it with an type-PF pci device. This fails with
  the below stack trace.

  After restarting neutron-sriov-agent and nova-compute services on the
  compute node and spawning an SR-IOV instance again, a type-VF pci
  device is selected, and instance spawning succeeds.

  Stack trace:
  2020-08-20 08:29:09.558 7624 DEBUG oslo_messaging._drivers.amqpdriver [-] 
received reply msg_id: 6db8011e6ecd4fd0aaa53c8f89f08b1b __call__ 
/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:400
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager 
[req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 
dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 
015e4fd7db304665ab5378caa691bb8b] [insta
  nce: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Instance failed to spawn: 
libvirtError: unsupported configuration: Interface type hostdev is currently 
supported on SR-IOV Virtual Functions only
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] Traceback (most recent call last):
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2274, in 
_build_resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] yield resources
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2054, in 
_build_and_run_instance
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] block_device_info=block_device_info)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3147, in 
spawn
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] destroy_disks_on_failure=True)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5651, in 
_create_domain_and_network
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] destroy_disks_on_failure)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] self.force_reraise()
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11]   File 
"/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in 
force_reraise
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe88-4020-9a9e-f4c437c6de11] six.reraise(self.type_, self.value, 
self.tb)
  2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 
9498ea75-fe8

[Yahoo-eng-team] [Bug 1896226] Re: The vnics are disappearing in the vm

2020-09-18 Thread sean mooney
** Also affects: neutron
   Importance: Undecided
   Status: New

** Tags added: libvirt neutron ovs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896226

Title:
  The vnics are disappearing in the vm

Status in neutron:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  Hi,

  We have a rocky OSA setup of branch 18.1.9.

  When we create a vm from a particular image, the vm comes with two
  missing vnics inside of it , out of four, which we have provisioned to
  it from four dhcp tenant networks. The plugin is OVS and the firewall
  driver is conntrack. It was working earlier until we have noticed this
  issue.

  So, if we reboot the VM, then, the vnics appear for a short time and
  again after a few seconds, they are disappearing. The disappearing
  vnics inside the vm have a floating ip associated with one of it and
  hence the vm is becoming unpingable. Again if we reboot the vm , it is
  comping up for a short time and again vanishing. At the moment when
  the pinging of the vm stops, we are not noticing any messages in the
  neutron server logs for that port, except when we try to manually do a
  vm reboot:


  LOGS
  -
  2020-09-18 08:41:50.197 17808 INFO neutron.wsgi 
[req-ff4ad70c-568d-4625-9bea-2a472351d00a 
dae4d1b704943b11cb10287e984f9367070915c18f7cadd48f915af92d4b4d03 
35391b98793b4c09bf87c91006d123c2 - f7834cb0083b4f8f81184b6595b46b34 
f7834cb0083b4f8f81184b6595b46b34] 172.29.236.183,172.29.236.21 "GET 
/v2.0/floatingips?tenant_id=35391b98793b4c09bf87c91006d123c2&port_id=095edf83-2d8d-494d-b820-ef7540aefa7c&port_id=0da3066c-a0f5-49df-a10a-20919595a5b8&port_id=298916ee-91a2-428f-86aa-c5ed5f034563&port_id=3e6d29fe-7bee-47d4-bc98-00d934fc5764&port_id=6d456cbe-425a-4d17-86f6-2b77ab88a42f&port_id=853adf00-92ac--8487-32a77e3efb66&port_id=86ec85de-609a-403d-b027-3097ac597e0c&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a&port_id=947c5610-b0f4-439a-abf8-51ba3dc8d212&port_id=b00f8eae-18fa-44c4-92e6-9ee75c7c599c&port_id=b8ba1a0c-ebd9-4278-bc83-9b60e1036f63&port_id=c6ce1a3c-b8b3-493d-b052-2810efacbf5e&port_id=c7facefa-565f-46be-8048-a333505ee177&port_id=d5e1fc5f-1a84-4366-bf36-548f2bdc0366
 HTTP/1.1" status: 200  len: 4363 time: 0.0851929
  2020-09-18 08:41:50.585 17814 INFO neutron.wsgi 
[req-349c0d49-9671-43a4-a541-f04facff2ee7 c42abde21dee4c848dc653df8ec429aa 
e02428b1700247b98ad1d563133f6174 - default default] 172.29.236.57,172.29.236.21 
"GET 
/v2.0/floatingips?fixed_ip_address=172.16.1.19&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a
 HTTP/1.1" status: 200  len: 1042 time: 0.0781569
  2020-09-18 08:41:59.951 17808 INFO neutron.wsgi 
[req-4bc06b83-cb44-469a-9ef1-0ce9a4fa0753 
dae4d1b704943b11cb10287e984f9367070915c18f7cadd48f915af92d4b4d03 
35391b98793b4c09bf87c91006d123c2 - f7834cb0083b4f8f81184b6595b46b34 
f7834cb0083b4f8f81184b6595b46b34] 172.29.236.183,172.29.236.21 "GET 
/v2.0/floatingips?tenant_id=35391b98793b4c09bf87c91006d123c2&port_id=298916ee-91a2-428f-86aa-c5ed5f034563&port_id=6d456cbe-425a-4d17-86f6-2b77ab88a42f&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a&port_id=947c5610-b0f4-439a-abf8-51ba3dc8d212
 HTTP/1.1" status: 200  len: 1042 time: 0.0917962
  2020-09-18 08:42:06.094 17817 DEBUG neutron.plugins.ml2.rpc 
[req-9cc5b8b2-a616-466d-9ed8-eae2b9b6056b - - - - -] Device 
86f1034d-837d-4e67-ad5e-63d9642a0b2a up at agent ovs-agent-b7w update_device_up 
/openstack/venvs/neutron-18.1.9/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:256
  2020-09-18 08:42:06.151 17817 DEBUG neutron.db.provisioning_blocks 
[req-9cc5b8b2-a616-466d-9ed8-eae2b9b6056b - - - - -] Provisioning complete for 
port 86f1034d-837d-4e67-ad5e-63d9642a0b2a triggered by entity L2. 
provisioning_complete 
/openstack/venvs/neutron-18.1.9/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138
  -

  The port we are talking about is "86f1034d-837d-4e67-ad5e-
  63d9642a0b2a" in the above logs.

  
  
  1.9/lib/python2.7/site-packages/neutron/notifiers/nova.py:242
  2020-09-18 08:41:56.744 16567 DEBUG novaclient.v2.client [-] REQ: curl -g -i 
-X POST http://wtl-int.sandvine.cloud:8774/v2.1/os-server-external-events -H 
"Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: 
python-novaclient" -H "X-Auth-Token: 
{SHA1}c8de3bb0dae419214d99c02879f89bd3a6a4dd78" -H 
"X-OpenStack-Nova-API-Version: 2.1" -d '{"events": [{"status": "completed", 
"tag": "6d456cbe-425a-4d17-86f6-2b77ab88a42f", "name": "network-vif-unplugged", 
"server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}, {"status": "completed", 
"tag": "298916ee-91a2-428f-86aa-c5ed5f034563", "name": "network-vif-unplugged", 
"server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}, {"status": "completed", 
"tag": "86f1034d-837d-4e67-ad5e-63d9642a0b2a", "name": "network-vif-unplugged", 
"server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}]}' _http_

[Yahoo-eng-team] [Bug 1896463] Re: evacuation failed: Port update failed : Unable to correlate PCI slot

2020-09-24 Thread sean mooney
just adding the previous filed downstream redhat bug
https://bugzilla.redhat.com/show_bug.cgi?id=1852110

this can happen in queens for context so when we root cause the issue
and fix it it should like be backported to queens. tjere are other older
bugs form newton that look similar related to unshelve so its posible
that the same issue is affecting multiple move operations.

** Bug watch added: Red Hat Bugzilla #1852110
   https://bugzilla.redhat.com/show_bug.cgi?id=1852110

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/victoria
   Importance: Low
 Assignee: Balazs Gibizer (balazs-gibizer)
   Status: Confirmed

** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Changed in: nova/ussuri
   Importance: Undecided => Low

** Changed in: nova/ussuri
   Status: New => Triaged

** Changed in: nova/train
   Importance: Undecided => Low

** Changed in: nova/train
   Status: New => Triaged

** Changed in: nova/stein
   Importance: Undecided => Low

** Changed in: nova/stein
   Status: New => Triaged

** Changed in: nova/rocky
   Importance: Undecided => Low

** Changed in: nova/rocky
   Status: New => Triaged

** Changed in: nova/queens
   Importance: Undecided => Low

** Changed in: nova/queens
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896463

Title:
  evacuation failed: Port update failed : Unable to correlate PCI slot

Status in OpenStack Compute (nova):
  Confirmed
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged
Status in OpenStack Compute (nova) victoria series:
  Confirmed

Bug description:
  Description
  ===
  if the _update_available_resource() of resource_tracker is called between 
_do_rebuild_instance_with_claim() and instance.save() when evacuating VM 
instances on destination host,  

  nova/compute/manager.py

  2931 def rebuild_instance(self, context, instance, orig_image_ref, 
image_ref,
  2932 +-- 84 lines: injected_files, new_pass, 
orig_sys_metadata,---
  3016 claim_ctxt = rebuild_claim(
  3017 context, instance, scheduled_node,
  3018 limits=limits, image_meta=image_meta,
  3019 migration=migration)
  3020 self._do_rebuild_instance_with_claim(
  3021 +-- 47 lines: claim_ctxt, context, instance, 
orig_image_ref,-
  3068 instance.apply_migration_context()
  3069 # NOTE (ndipanov): This save will now update the host 
and node
  3070 # attributes making sure that next RT pass is consistent 
since
  3071 # it will be based on the instance and not the migration 
DB
  3072 # entry.
  3073 instance.host = self.host
  3074 instance.node = scheduled_node
  3075 instance.save()
  3076 instance.drop_migration_context()

  the instance is not handled as managed instance of the destination
  host because it is not updated on DB yet.

  2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-
  b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance
  22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by
  this compute host but has allocations referencing this compute host:
  {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}.
  Skipping heal of allocation because we do not know what to do.

  And so the SRIOV ports (PCI device) was free by clean_usage()
  eventhough the VM has the VF port already.

   743 def _update_available_resource(self, context, resources):
   744 +-- 45 lines: # initialize the compute node object, creating 
it--
   789 self.pci_tracker.clean_usage(instances, migrations, orphans)
   790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

  After that, evacuated this VM to another compute host again, we got
  the error like below.


  Steps to reproduce
  ==
  1. create a VM on com1 with SRIOV VF ports.
  2. stop and disable nova-compute service on com1
  3. wait 60 sec (nova-compute reporting interval)
  4. evauate the VM to com2
  5. wait the VM is activ

[Yahoo-eng-team] [Bug 1581977] Re: Invalid input for dns_name when spawning instance with .number at the end

2020-11-27 Thread sean mooney
personally i thihnk we whoudl clouse this as invlid.

this is either a feature request to allow setting different hostnames
form displayname as part of nova booth or a request to expand the
allowed set of vm names to allow '.' which currently not allowed and
transfrom it to some other value to generate a vlaid hostname.


this hasnever been supported and is a well know requirement of the nova api 
that the vm name has to be a vlaid hostname meaning it may not contian a .

so i dont think this is a vaild bug.

we coudl impove documentaion around this or make the api stricter to
reject the request eairler but anything beyond that would require a spec
and an api microverion bump as it would be a new feature.

given the agent of this bug im going to update the tragie status

** Changed in: nova
   Importance: Low => Wishlist

** Changed in: nova
   Status: Triaged => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1581977

Title:
  Invalid input for dns_name when spawning instance with .number at the
  end

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  When attempting to deploy an instance with a name which ends in dot
   (e.g. .123, as in an all-numeric TLD) or simply a name that,
  after conversion to dns_name, ends as ., nova conductor fails
  with the following error:

  2016-05-15 13:15:04.824 ERROR nova.scheduler.utils [req-4ce865cd-e75b-
  4de8-889a-ed7fc7fece18 admin demo] [instance:
  c4333432-f0f8-4413-82e8-7f12cdf3b5c8] Error from last host:
  silpixa00394065 (node silpixa00394065): [u'Traceback (most recent call
  last):\n', u'  File "/opt/stack/nova/nova/compute/manager.py", line
  1926, in _do_build_and_run_instance\nfilter_properties)\n', u'
  File "/opt/stack/nova/nova/compute/manager.py", line 2116, in
  _build_and_run_instance\ninstance_uuid=instance.uuid,
  reason=six.text_type(e))\n', u"RescheduledException: Build of instance
  c4333432-f0f8-4413-82e8-7f12cdf3b5c8 was re-scheduled: Invalid input
  for dns_name. Reason: 'networking-ovn-ubuntu-16.04' not a valid PQDN
  or FQDN. Reason: TLD '04' must not be all numeric.\nNeutron server
  returns request_ids: ['req-7317c3e3-2875-4073-8076-40e944845b69']\n"]

  This throws one instance of the infamous Horizon message: Error: No
  valid host was found. There are not enough hosts available.

  
  This issue was observed using stable/mitaka via DevStack (nova commit 
fb3f1706c68ea5b58f05ea810c6339f2449959de).

  In the above example, the instance name is "networking-ovn (Ubuntu
  16.04)", which resulted in an attempted dns_name="networking-ovn-
  ubuntu-16.04", where the 04 was interpreted as a TLD and,
  consequently, an invalid TLD.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1581977/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1901707] Re: race condition on port binding vs instance being resumed for live-migrations

2020-12-16 Thread sean mooney
adding nova as there is a nova element that need to be fixed also.

because nova was observing the network-vif-plugged event form the dhcp
agent we were not filtinging our wait condition on live migrate to only
wait for backend that had plugtime events.

so once this is fixed by rodolfos patch it actully breaks live migration
because we are waiting for an event that will never come until
https://review.opendev.org/c/openstack/nova/+/602432 is merged.

for backporting reasons i am working in a seperate trivial patch to only
wait for backends that send plugtime event. that patch will be
backported first allowing rodolfos patch to be backported before
https://review.opendev.org/c/openstack/nova/+/602432

i have 1 unit test left to update in the plug time patch and then ill
push it and reference this bug.

** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => Triaged

** Changed in: nova
   Importance: Undecided => High

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1901707

Title:
  race condition on port binding vs instance being resumed for live-
  migrations

Status in neutron:
  In Progress
Status in OpenStack Compute (nova):
  Triaged

Bug description:
  This is a separation from the discussion in this bug
  https://bugs.launchpad.net/neutron/+bug/1815989

  There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 
goes through in
  detail the flow on a Train deployment using neutron 15.1.0 (controller) and 
15.3.0 (compute) and nova 20.4.0

  There is a race condition where nova live-migration will wait for
  neutron to send the network-vif-plugged event but when nova receives
  that event the live migration is faster than the OVS l2 agent can bind
  the port on the destination compute node.

  This causes the RARP frames sent out to update the switches ARP tables
  to fail causing the instance to be completely unaccessible after a
  live migration unless these RARP frames are sent again or traffic is
  initiated egress from the instance.

  See Sean's comments after for the view from the Nova side. The correct
  behavior should be that the port is ready for use when nova get's the
  external event, but maybe that is not possible from the neutron side,
  again see comments in the other bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1901707/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1909972] Re: a number of tests fail under ppc64el arch

2021-01-08 Thread sean mooney
im reopening this and marking it as triaged

ppc64le has been supported with thrid party integration testing provide by the
IBM PowerKVM CI on ppc64el for years here is an example test run 
https://oplab9.parqtec.unicamp.br/pub/ppc64el/openstack/nova/68/767368/1/check/tempest-dsvm-full-focal-py3/ef10362/


redhat also ships version fo nova for ppc64el in our downstream product
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/release_notes/chap-introduction#Content_Delivery_Network_CDN_Channels

that use libvirt kvm on ppc64el starting with power8 in osp 13 i think
and now support power9.

stephen is correct that we have no 1st party ci that covers ppc64el but
i think its still ok to fix the unit test, the are more likely to
regress yes but perhaps we can work with infra to see if we can use qemu
to emulate ppc or see if any of our providers have ppc avaiable.

i know rackspace used to run a large amount of there cloud on ppc.

** Changed in: nova
   Status: Won't Fix => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1909972

Title:
  a number of tests fail under ppc64el arch

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Hi,

  As per this Debian bug entry:
  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=976954

  a number of unit tests are failing under ppc64el arch. Please fix
  these or exclude the tests on this arch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1909972/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1852437] Re: Allow ability to disable individual CPU features via `cpu_model_extra_flags`

2021-02-05 Thread sean mooney
setting this back to invalid as Matt Riedemann siad this is a feature not a bug 
fix.
it is trcked as a blueprint 
https://blueprints.launchpad.net/nova/+spec/allow-disabling-cpu-flags and we 
shoudl use that to track it
not this bug.

** Changed in: nova
   Status: Triaged => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1852437

Title:
  Allow ability to disable individual CPU features via
  `cpu_model_extra_flags`

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  What?
  -

  When using a custom CPU model, Nova currently allows enabling
  individual CPU flags/features via the config attribute,
  `cpu_model_extra_flags`:

  [libvirt]
  cpu_mode=custom
  cpu_model=IvyBridge
  cpu_model_extra_flags="pcid,ssbd, md-clear"

  The above only lets you enable the CPU features.  This RFE is to also
  allow _disabling_ individual CPU features.

  
  Why?
  ---

  A couple of reasons:

- An Operator wants to generate a baseline CPU config (that facilates
  live migration) across his Compute node pool.  However, a certain
  CPU flag is causing an inteolerable performance issue for their
  guest workloads.  If the Operator isolated the problem to _that_
  specific CPU flag, then she would like to disable the flag.

- More importantly, a specific CPU flag might trigger a CPU
  vulnerability.  In such a case, the mitigation for it could be to
  simply _disable_ the offending CPU flag.

  Allowing disabling of individual CPU flags via Nova would enable the
  above use cases.

  
  How?
  

  By allowing the notion of '+' / '-' to indicate whether to enable to
  disable a given CPU flag.

  E.g. if you specify the below in 'nova.conf' (on the Compute nodes):

  [libvirt]
  cpu_mode=custom
  cpu_model=IvyBridge
  cpu_model_extra_flags="+pcid,-mtrr,ssbd"

  Then, when you start an instance, Nova should generate the below XML:

   
IvyBridge
Intel



  

  
  Note that the requirement to specify '+' / '-' for individual flags
  should be optional.  If neither is specified, then we should assume '+',
  and enable the feature (as shown above for the 'ssbd' flag).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1852437/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1915055] Re: launched_at's reset when resizing/reverting and unshelving impacts "openstack usage show"

2021-02-08 Thread sean mooney
** Changed in: nova
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915055

Title:
  launched_at's reset when resizing/reverting and unshelving impacts
  "openstack usage show"

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  environment: devstack-master stacked Jan 28th 2021

  The "openstack usage show" commands provide metrics related to run time of 
the instance, as seen below:
  +---+--+
  | Field | Value|
  +---+--+
  | CPU Hours | 260.68   |
  | Disk GB-Hours | 260.68   |
  | RAM MB-Hours  | 66733.63 |
  | Servers   | 3|
  +---+--+

  The logic in [0] determines how those values are calculated. They are
  based on the launched_at and terminated_at fields.

  Some operations, such as resize and unshelve, reset the launched_at
  field. Therefore, for a given instance, the run time information is
  wiped, as if it had never run before.

  Steps to reproduce:
  1. Create an instance.
  2. Wait a few minutes, start monitoring usage with "watch openstack usage 
show --project admin" on a separate tab.
  3. Either shelve and unshelve an instance, or resize the instance and revert 
the resize.
  4. Notice how the "openstack usage show" statistics suddenly drops to a lower 
value and then continues to increase.

  Expected result:
  Statistics would not drop, should continue measuring.

  Some possible solutions:
  1. Stop resetting the launched_at field
  2. Change the field used for calculation at [0] to something else (maybe 
created_at?)

  [0]
  
https://github.com/openstack/nova/blob/6c0ceda3659405149b7c0b5c283275ef0a896269/nova/api/openstack/compute/simple_tenant_usage.py#L74

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1915055/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1913641] Re: Incorrect Shelved_offloaded instance metrics on openstack usage show output

2021-02-08 Thread sean mooney
** Changed in: nova
   Status: In Progress => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1913641

Title:
  Incorrect Shelved_offloaded instance metrics on openstack usage show
  output

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  env: bionic-ussuri and bionic-wallaby (devstack)

  When running "openstack usage show --project ", having only
  shelved_offloaded instances in the project, it continues to track
  metrics as if the instance was running, even though it is not. See
  output below:

  $ openstack server list
  
+--+--+---++--+---+
  | ID   | Name | Status| Networks  
 | Image| 
Flavor|
  
+--+--+---++--+---+
  | a8d3fbb6-1734-4e3f-81db-b1c42a462bf7 | ins1 | SHELVED_OFFLOADED | 
private=10.0.0.30, fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0 | 
cirros-0.5.1-x86_64-disk | cirros256 |
  
+--+--+---++--+---+

  $ openstack usage show --project admin

  Usage from 2020-12-31 to 2021-01-29 on project 
1bfc9c13d7da4a4183c0b16cfa80020f:
  +---+---+
  | Field | Value |
  +---+---+
  | CPU Hours | 0.04  |
  | Disk GB-Hours | 0.04  |
  | RAM MB-Hours  | 9.43  |
  | Servers   | 1 |
  +---+---+

  
  $ openstack server show ins1

  
+-+-+
  | Field   | Value 
  |
  
+-+-+
  | OS-DCF:diskConfig   | MANUAL
  |
  | OS-EXT-AZ:availability_zone |   
  |
  | OS-EXT-SRV-ATTR:host| None  
  |
  | OS-EXT-SRV-ATTR:hypervisor_hostname | None  
  |
  | OS-EXT-SRV-ATTR:instance_name   | instance-0001 
  |
  | OS-EXT-STS:power_state  | Shutdown  
  |
  | OS-EXT-STS:task_state   | None  
  |
  | OS-EXT-STS:vm_state | shelved_offloaded 
  |
  | OS-SRV-USG:launched_at  | 2021-01-28T19:33:34.00
  |
  | OS-SRV-USG:terminated_at| None  
  |
  | accessIPv4  |   
  |
  | accessIPv6  |   
  |
  | addresses   | private=10.0.0.30, 
fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0  |
  | config_drive|   
  |
  | created | 2021-01-28T19:33:25Z  
  |
  | flavor  | cirros256 (c1)
  |
  | hostId  |   
  |
  | id  | a8d3fbb6-1734-4e3f-81db-b1c42a462bf7  
  |
  | image   | cirros-0.5.1-x86_64-disk 
(9e09f573-99f7-4f7c-bf16-47d475320207) |
  | key_name| None  
  |
  | name| ins1  
  |
  | project_id  | 1bfc9c13d7da4a4183c0b16cfa80020f  
  |
  | properties  |   
  |
  | security_groups | name='default'
  |
  | status  | SHELVED_OFFLOADED 
  |
  | updated 

[Yahoo-eng-team] [Bug 1915255] Re: [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById

2021-02-11 Thread sean mooney
This is a real issue because the Cavium ThunderX hardware violates an assumtion 
we have with regards to PF having netdevs if VF do.
we just need to re add this try excpet that was removed.
https://review.opendev.org/c/openstack/nova/+/739131/12/nova/virt/libvirt/driver.py#b6957

it was orginally removed as we are only looking at the sub set of VFs that are 
nics
but since the Cavium ThunderX does not assing a PF to all VFs
per https://bugs.launchpad.net/charm-nova-compute/+bug/1771662

we need to catch the exception in this case as we did before.

this means that minium bandwidth based QOS cannot be implemented on
this hardware as we rely on the PF netdev name to correlate the
bandwidth between nova and neutron but other functionality shoudl work.
The only way to support min bandwith qos on thsi hardware would be to
altere the nic driver or enhance nova/neutron to support using the PF
pci address instead of the parent netdev name.



** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
   Status: New => Triaged

** Also affects: nova/victoria
   Importance: Undecided
   Status: New

** Changed in: nova/victoria
   Status: New => Triaged

** Changed in: nova/victoria
   Importance: Undecided => Medium

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1915255

Title:
  [Victoria] nova-compute won't start on aarch64 - raises
  PciDeviceNotFoundById

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) victoria series:
  Triaged

Bug description:
  Description
  ===

  When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on
  arm64/aarch64, nova-compute 22.0.1 fails to start with (nova-
  compute.log):

  --
  Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in 
get_ifname_by_pci_address
  dev_info = os.listdir(dev_path)
  FileNotFoundError: [Errno 2] No such file or directory: 
'/sys/bus/pci/devices/0002:01:00.1/physfn/net'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, 
in _update_available_resource_for_node
  self.rt.update_available_resource(context, nodename,
File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", 
line 880, in update_available_resource
  resources = self.driver.get_available_resource(nodename)
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 
8473, in get_available_resource
  data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 
7223, in _get_pci_passthrough_devices
  pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 
7223, in 
  pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 
7199, in _get_pcidev_info
  device.update(_get_device_type(cfgdev, address, dev, net_devs))
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 
7154, in _get_device_type
  parent_ifname = pci_utils.get_ifname_by_pci_address(
File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in 
get_ifname_by_pci_address
  raise exception.PciDeviceNotFoundById(id=pci_addr)
  nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found
  --

  This results in an empty `openstack hypervisor list`.

  This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We
  also haven't seen this on other architectures (yet?). This code
  actually appeared between Ussuri and Victoria, [0] i.e. the first
  version having it is 22.0.0.

  $ lspci | grep 0002:01:00.1
  0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface 
Controller virtual function (rev 09)

  Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net`
  but I'm not sure if that's really a problem or if nova-compute should
  just catch the exception and move on?

  A similar issue in the past [1] shows that this might be an issue
  specific to the Cavium Thunder X NIC.

  Related issue: [2]

  Steps to reproduce
  ==

  Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium
  Thunder X NIC if possible). I personally use Juju [3] for deploying an
  entire OpenStack Victoria setup to a lab:

  $ git clone https://github.com/openstack-charmers/openstack-bun

[Yahoo-eng-team] [Bug 1798904] Re: tenant isolation is bypassed if port admin-state-up=false

2021-02-18 Thread sean mooney
im going to move the os-vif bug to fixed released as
https://github.com/openstack/os-vif/commit/d291213f1ea62f93008deef5224506fb5ea5ee0d
fixes what can be fixed by os-vif alone. this was part of  
https://github.com/openstack/os-vif/releases/tag/1.13.0 

i am going to leave the nova bug as is untill this has been tested end to end 
as i belive
 https://review.opendev.org/c/openstack/nova/+/602432 is still required for nova


** Changed in: os-vif
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1798904

Title:
  tenant isolation is bypassed if port admin-state-up=false

Status in neutron:
  New
Status in OpenStack Compute (nova):
  Confirmed
Status in os-vif:
  Fix Released
Status in OpenStack Security Advisory:
  Incomplete

Bug description:
  This bug is a second variant of
  https://bugs.launchpad.net/neutron/+bug/1734320

  The original bug which is now public, was limited to the case where a vm is 
live
  migrated resulting in a short window where the teant instance could recive 
vlan
  tag traffic on the destination node before the neutron ml2 agent wires up the
  port on the ovs bridge.

  Note that while the original bug implied  that the vm was only able to 
easedrop
  on trafic it was also possible for the vm to send traffic to a different 
tenant
  network by creating a vlan subport which corresponded to vlan in use for 
tenant
  isolation on the br-int.

  The original bug was determined to be a result of the fact that during live
  migratrion if the vif-type was ovs and ovs_hybrid_plug=false the VIF was 
pluged
  to the ovs bridge by the hyperviors when the vm was started on the destination
  node instead of pre plugging it and waiting for neutron to signel it had
  completed wireing up the port before migrating the instance.

  Since live migration is a admin only operation unless intentionally change by
  the operator the scope of this inital vector was limited.

  The second vector to create a running vm with an untagged port does not 
require
  admin privalages.

  If a user creates a neutron port and sets the admin-state-up field to
  False

  openstack port create --disable --network < my network> 

  and then either boots a vm with this port

  openstack server create --flavor  --image  --port
   

  or attaches the port to an existing vm

  openstack server add port  

  This will similarly create a window where the port is attached to the guest 
but
  neutron has not yet wired up the interface.

  Note that this was repoted to me for queens with ml2/ovs and iptables 
firewall.
  i have not personnaly validated that how to recreate it but i intend to
  to reporduce this on master next week an report back.

  i belive there are a few way that this can be mitagated.
  the mitgations for the live migration variant will narrow the window
  in which this variant will be viable and in general may be suffient in the
  cases where the netruon agent is is running correctly.

  but a more complete fix would involve modifiaction to nova neutron and
  os-vif.

  from a neutron perspective we could extend the neturon port binidngs to 
container 2 addtion
  fields.

  ml2_driver_names:
  a orderd comma sperated list of the agents that bound this port.
  Note: this will be used by os-vif to determin if it should preferom adtion
  actions such as taging the port, or setting its tx/rx quese down
  to mitigate this issue.

  ml2_port_events
  a list of time port stats events are emitted by a ml2 driver
  or a enum.
  Note: currently ml2/ovs signals nova that it has completed wiring
  up the port only when the agent has configured the vswitch but odl send 
the
  notification when the port is bound in the ml2 driver before the vswtich 
is
  configured. to be able to use these more effectivly with in nova we need
  to be able to know if the event is sent only

  additionally change to os-vif and nova will be required to process
  this new info.

  on the nova side if we know that a backend will send a event when the port is
  wired up on the vswitch we may be able to make attach wait untll that has been
  done.

  if os-vif know the ovs plugin was been used with ml2/ovs and the ovs l2 agent 
it could
  also contionally wait for the interface to be tagged by neutron.
  this could be done via a config option however since the plugin is shared with
  sdn controllers that manage ovs such as  odl, ovn, onos and dragon flow it 
would
  have to default to not waiting as these other backends do not use vlans for
  tenant isolation.

  similarly instad of waiting we could have os-vif apply a drop rule and vlan 
4095
  based on a config option. again this would have to default to false or 
insecure
  to not break sdn based deploymetns.

  if we combine one of the config options with the ml2_driver_names change

[Yahoo-eng-team] [Bug 1909120] Re: n-api should reject requests to detach a volume when the compute is down

2021-02-24 Thread sean mooney
updating this to fix release sicne
https://review.opendev.org/c/openstack/nova/+/768352 is merged on master

backports have been propsoed to u and v so i have add those and i also
added train since i assume we want this in Train downstream?

if this need to go back futher then feel free to add those too but i
tred to pick a resonable set of branches.

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Also affects: nova/victoria
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1909120

Title:
  n-api should reject requests to detach a volume when the compute is
  down

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) train series:
  New
Status in OpenStack Compute (nova) ussuri series:
  New
Status in OpenStack Compute (nova) victoria series:
  New

Bug description:
  Description
  ===
  At present requests to detach volumes from instances on down computes are 
accepted by n-api but will never be acted upon if the n-cpu service hosting the 
instance is down.

  n-api should reject such requests with a simple 409 HTTP conflict.

  Steps to reproduce
  ==
  * Attempt to detach a volume from an instance residing on a down compute.

  Expected result
  ===
  Request rejected by n-api

  Actual result
  =
  Request accepted but never completed

  Environment
  ===
  1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/

 Master

  2. Which hypervisor did you use?
 (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
 What's the version of that?

 libvirt + QEMU/KVM

  2. Which storage type did you use?
 (For example: Ceph, LVM, GPFS, ...)
 What's the version of that?

 N/A

  3. Which networking type did you use?
 (For example: nova-network, Neutron with OpenVSwitch, ...)

 N/A

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1909120/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1918419] Re: vCPU resource max_unit is hardcoded

2021-03-12 Thread sean mooney
in general i dont feel like this is a valid bug.
it is perhaps a feature request which chould be acomplished by an extention to 
Provider.yaml
to allow standard resouce class inventories to be updated by the operator.

in general what you are asking for is intentionally not allowed.

max_unit must be less then total to prevent oversubsrction of a singel
allocation against istelf.

e.g. if total was 4 and max_unit as 8 the we could not actully allocate
8 to a vm without the vm over subsribing against its self.

this would be invalid there for changing max_unit in this way would be
incorrect.

the supported way to adress your current problem would be to resize your 
impacted vms before moving them perhaps to ones with 2 numa node
e.g.  hw:numa_nodes=2 hw:mem_page_size=small.
note: hw:mem_page_size should always be set if you use hw:numa_nodes

im going to mark this as invalid for now but we could discuss this at the PTG
realisticaly though i dont see a clean way to resovle this while also keeping 
the vms alive.
resize would work but the live requirement is what makes that unpalitable.



** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1918419

Title:
  vCPU resource max_unit is hardcoded

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Becasue the spectre/meltdown vulnerabilities (2018) we needed to
  disable SMT in all public facing compute nodes. As result the number
  of available cores was reduced by half.

  We had flavors available with 32vCPUs that couldn't be used anymore
  because placement max_unit for vCPUs is hardcoded to be the total
  number of cpus regardless the allocation_ratio.

  To me it's a sensible default but doesn't offer any flexibility for
  operators.

  See the IRC discussion at that time:
  
http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html

  
  As conclusion, we informed the users that we couldn't offer those flavors 
anymore. The old VMs (that were created before disabling SMT) continued to run 
without any issue.

  So... after ~2 year I'm hitting again this problem :)

  These compute nodes need now to be retired and we are live migrating
  all the instances to the replacement hardware.

  When trying to live migrate these instances (vCPUs > max_unit) it
  fails, becasue the migration allocation can't be created against the
  source compute node. For the new hardware (dest_compute) the vCPUS  <
  max_unit, so no issue for the new allocation.

  I'm working around this problem (to live migrate the instances),
  patching the code to have a higher max_unit for vCPUs in the compute
  nodes hosting these instances.

  I feel that this issue should be discussed again and consider the
  possibility to configure the max_unit value.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1918419/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2017358] Re: VM doesn't boot after qemu-img convert from VMDK to RAW/QCOW2

2023-04-24 Thread sean mooney
** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2017358

Title:
  VM doesn't boot after qemu-img convert from VMDK to RAW/QCOW2

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  I'm trying to migrate a Windows Server (2016/2019) VM from
  vSphere/VMWare to OpenStack (KVM-QEMU). I followed these instructions:
  https://platform9.com/docs/openstack/tutorials-migrate-windows-vm-
  vmware-kvm without success.

  After downloading the VMDK file vCenter/vSphere in an Ubuntu Server
  with GUI installed (a server used for this purpose), I used this
  command:

  
  ```
  ~# qemu-img convert -O qcow2 win2016-copy-flat.vmdk win2016.qcow2

  ~# qemu-img convert -O raw win2016-copy-flat.vmdk win2016.qcow2

  ```

  I tried with both formats, RAW and QCOW2, and after importing into my
  controller node that image with the next command:

  ```
  ~# openstack image create --insecure --container-format bare "win2016-raw" 
--disk-format raw --file /tmp/win2016.qcow2

  ~# openstack image create --insecure --container-format bare
  "win2016-qcow2" --disk-format qcow2 --file /tmp/win2016.qcow2

  ```

  
  Finally, I tested creating a new instance and I obtain this error message:

  Booting from Hard Disk...
  Boot failed: not a bootable disk

  No bootable device.

  
  (Exactly like this issue: 
https://github.com/cloudbase/windows-imaging-tools/issues/324)

  After googling a lot and a couple of days, I tried another way, to
  change the chipset of the image from i440fx to q35, also, enabling the
  boot menu and secure boot, like in this link:
  https://bugzilla.redhat.com/show_bug.cgi?id=1663212 following the
  documentation about the properties of images
  (https://docs.openstack.org/ocata/cli-reference/glance-property-
  keys.html).

  
  Then, my instance continues without booting, with a different message but 
with the same result, something link this: 
https://github.com/ipxe/pipxe/issues/14 and a similar screenshot of this thread 
https://forums.freebsd.org/threads/i-got-error-bdsdxe-failed-to-load-boot0001-when-i-boot-kali-linux-vm-via-uefi-firmware.82773/

  Also, I explored the possibility of the partition table being
  corrupted and I tried to repair it with `gdisk` command; with the same
  result. So, which other way can I test?

  Context, I have my services of OpenStack running over a Ubuntu Servers
  cluster with 3 nodes and 1 controller, deployed with kolla-ansible
  over docker to have high availability, and CEPH as storage, configured
  with rbd (rados) to work with Glance/Cinder.

  I have tested different Windows Server editions from scratch,
  installing the S.O. locally with KVM and VirtManager, then uploading
  the QCOW2 disk to OpenStack, and works fine, and other Linux
  distributions. But this specific scenario migrating with Windows
  Server from vSphere to OpenStack crashes on that point, with the
  bootable device.

  
  Thank you for reading and for your time.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2017358/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2018318] Re: 'openstack server resize --flavor' should not migrate VMs to another AZ

2023-05-02 Thread sean mooney
** Changed in: nova
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2018318

Title:
  'openstack server resize --flavor' should not migrate VMs to another
  AZ

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Before I start, let me describe the agents involved in the process
  migration and/or resize flow of OpenStack (in this case, Nova
  component). These are the mapping and interpretation I created while
  troubleshooting the reported problem.

  - Nova-API: the agent responsible for receiving the HTTP requests 
(create/resize/migrate) from the OpenStack end-user. It does some basic 
validation, and then sends a message with the requested command via RPC call to 
other agents.
  - Nova-conductor: the agent responsible to "conduct/guide" the workflow. 
Nova-conductor will read the commands from the RPC queue and then process the 
request from Nova-API. It does some extra validation, and for every command 
(create/resize/migrate), it asks for the scheduler to define the target host 
for the operation (if the target host was not defined by the user).
  - Nova-scheduler: the agent responsible to "schedule" VMs on hosts. It 
defines where a VM must reside. It receives the "select host request", and 
processes the algorithms to determine where the VM can be allocated. Before 
applying the scheduling algorithms, it calls/queries the Placement system to 
get the possible hosts where VMs might be allocated. I mean, hosts that fit the 
requested parameters, such as being in a given Cell, availability zone (AZ), 
having available/free computing resources to support the VM. The call from 
Nova-scheduler to Placement is an HTTP request.
  - Placement: behaves as an inventory system. It tracks where resources are 
allocated, their characteristics, and providers (hosts/storage/network system) 
where resources are (can be) allocated. It also has some functions to return 
the possible hosts where a "request spec" can be fulfilled.
  - Nova: the agent responsible to execute/process the commands and implement 
actions in the hypervisor. 

  
  Then, we have the following workflow from the different processes. 

  - migrate: Nova API ->(via RPC call --
  nova.conductor.manager.ComputeTaskManager.live_migrate_instance) Nova
  Conductor (loads request spec) -> (via RPC call) Nova scheduler ->
  (via HTTP) Placement -> (after the placement return) Nova scheduler
  executes the filtering of the hosts, based on active filters. - >
  (return for the other processes in conductor) -> (via RPC call) Nova
  to execute the migration.

  - resize: Nova API ->(via RPC call --
  nova.conductor.manager.ComputeTaskManager.migrate_server --
  _cold_migrate) Nova Conductor (loads request spec) -> (via RPC call)
  Nova scheduler -> (via HTTP) Placement -> (after the placement return)
  nova scheduler executes the filtering of the hosts, based on active
  filters - > (return for the other processes), in Nova conductor ->
  (RPC call) Nova to execute the cold migration and start the VM again
  with the new computing resource definition

  As a side note, this mapping also explains why the "resize" was not
  executing the CPU compatibility check that the "migration" is
  executing (this is something else that I was checking, but it is worth
  mentioning here). The resize is basically a cold migration to a new
  host, where a new flavor (definition of the VM) is applied; thus, it
  does not need to evaluate CPU feature set compatibility.

  The problem we are reporting happens with both "migrate" and "resize"
  operations. Therefore, I had to add some logs to see what was going on
  there (that whole process is/was "logless"). The issue happens because
  Placement always returns all hosts of the environment for a given VM
  being migrated (resize is a migration process); this only happens if
  the VM is deployed without defining its availability zone in the
  request spec.

  To be more precise,  Nova-conductor in
  
`nova.conductor.tasks.live_migrate.LiveMigrationTask._get_request_spec_for_select_destinations`
  
(https://github.com/openstack/nova/blob/3d83bb3356e10355437851919e161f258cebf761/nova/conductor/tasks/live_migrate.py#L460)
  always uses the original request specification, used to deploy the VM,
  to find a new host to migrate it to. Therefore, if the VM is deployed
  to a specific AZ, it will always send this AZ to Placement (because
  the AZ is in the request spec), and Placement will filter out hosts
  that are not from that AZ. However, if the VM is deployed without
  defining the AZ, Nova will select a host (from an AZ) to deploy it
  (the VM), and when migrating the VM, Nova is not trying to find
  another host in the same AZ where the VM is already running. It is
  always behaving as a new deployment process to select the host.

  That raised a que

[Yahoo-eng-team] [Bug 2020215] [NEW] ml2/ovn refuses to bind port due to dead agent randomly in the nova-live-migrate ci job

2023-05-19 Thread sean mooney
Public bug reported:

we have seen random failures of

test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]

in the nova-live-migaration job with the following error

Details: {'code': 400, 'message': 'Migration pre-check error: Binding
failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
neutron logs for more information.'}


looking at the neuton log we see 

May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
service neutron] Refusing to bind port
e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:


May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind
port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for
vnic_type normal using segments [{'id':
'1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
'physical_network': None, 'segmentation_id': 525, 'network_id':
'745f0724-2779-4d60-845c-8f673d567d0d'}]


and the following in the neutorn-ovn-metadata-agent on the host where the VM is 
migrating too.

May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
table for 10 seconds {{(pid=38857) run
/opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}

This looks like it might be related to

https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e

This modified the code to add some randomness due to
https://bugs.launchpad.net/neutron/+bug/1991817

but that seams to negitivly impact the stability of the agent.

to fix this i will propose a patch to change the interval form

interval = randint(0, cfg.CONF.agent_down_time // 2)

to

interval = randint(0, cfg.CONF.agent_down_time // 3)

to increase the likelihood that we send the heartbeat in time.

when we are making calls to privsep and ovs the logs stop for multiple
second while those operations are happening and if that happens the the
wrong time  i belive this leads to use missing the heartbeat interval.

** Affects: neutron
 Importance: Undecided
 Assignee: sean mooney (sean-k-mooney)
 Status: New

** Changed in: neutron
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2020215

Title:
  ml2/ovn refuses to bind port due to dead agent randomly in the nova-
  live-migrate ci job

Status in neutron:
  New

Bug description:
  we have seen random failures of

  
test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume]

  in the nova-live-migaration job with the following error

  Details: {'code': 400, 'message': 'Migration pre-check error: Binding
  failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check
  neutron logs for more information.'}

  
  looking at the neuton log we see 

  May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING
  neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb-
  ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4
  service neutron] Refusing to bind port
  e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent:
  

  May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR
  neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152
  req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to
  bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853
  for vnic_type normal using segments [{'id':
  '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve',
  'physical_network': None, 'segmentation_id': 525, 'network_id':
  '745f0724-2779-4d60-845c-8f673d567d0d'}]

  
  and the following in the neutorn-ovn-metadata-agent on the host where the VM 
is migrating too.

  May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]:
  DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis
  table for 10 seconds {{(pid=38857) run
  /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}}

  This looks like it might be related to

  
https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e

  This modified the code to add some randomness due to
  https://bugs.launchpad.net/neutron/+bug/1991817

  but that seams to negitivly impact the stability of the agent.

  to fix this i will propose a patch to change the interval form

  interval = randint(0, 

[Yahoo-eng-team] [Bug 2020028] Re: evacuate an instance on non-shared storage succeeded and boot image is rebuilt

2023-05-22 Thread sean mooney
This is the expected behavior.

evacuate of image-backed vms rebuilds the root disk because that is the
expected behaviour in a cloud env where the instance root disk should
not contain any valuable data.

This is functioning precisely how the API was designed to work.

in fact, the preservation fo disk for BFV or instance with shared
storage is perhaps the more surprising aspect.

evacuate should be assumed to be destructive unless you are using boot
form volume.

it may or may not be destructive depending on the storage configuration
of the compute ndoes when used with non boot from volume instnaces.


** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2020028

Title:
  evacuate an instance on non-shared storage succeeded and boot image is
  rebuilt

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===

  evacuate an instance on non-shared storage succeeded and boot image is
  rebuilt

  Steps to reproduce
  ==

  1. Create a two compute nodes cluster without shared storage
  2. boot a image backed virtual machine
  3. shutdown down the compute node where vm is running
  4. evacuate instance to another node

  Expected:

  evacuate failed

  Real:

  evacuate succeeded and boot image is rebuilt.

  Version
  ===

  Using nova victoria version

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2020028/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2025813] Re: test_rebuild_volume_backed_server failing 100% on nova-lvm job

2023-07-05 Thread sean mooney
https://review.opendev.org/q/Ia198f712e2ad277743aed08e27e480208f463ac7

** Also affects: nova/antelope
   Importance: Undecided
   Status: New

** Also affects: nova/zed
   Importance: Undecided
   Status: New

** Also affects: nova/yoga
   Importance: Undecided
   Status: New

** Changed in: nova/antelope
   Status: New => In Progress

** Changed in: nova/antelope
   Importance: Undecided => Critical

** Changed in: nova/antelope
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova/yoga
   Status: New => Triaged

** Changed in: nova/yoga
   Importance: Undecided => Critical

** Changed in: nova/zed
   Status: New => Triaged

** Changed in: nova/zed
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2025813

Title:
  test_rebuild_volume_backed_server failing 100% on nova-lvm job

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) antelope series:
  In Progress
Status in OpenStack Compute (nova) yoga series:
  Triaged
Status in OpenStack Compute (nova) zed series:
  Triaged

Bug description:
  After the tempest patch was merged [1] nova-lvm job started to fail
  with the following error in test_rebuild_volume_backed_server:

  
  Traceback (most recent call last):
File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in 
wrapper
  return f(*func_args, **func_kwargs)
File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
868, in test_rebuild_volume_backed_server
  self.get_server_ip(server, validation_resources),
File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in 
get_server_ip
  return compute.get_server_ip(
File "/opt/stack/tempest/tempest/common/compute.py", line 76, in 
get_server_ip
  raise lib_exc.InvalidParam(invalid_param=msg)
  tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When 
validation.connect_method equals floating, validation_resources cannot be None

  As discussed on IRC with Sean [2], the SSH validation is mandatory now
  which is disabled in the job config [2].

  [1] https://review.opendev.org/c/openstack/tempest/+/831018
  [2] 
https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38
  [3] 
https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2025813/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2025813] Re: test_rebuild_volume_backed_server failing 100% on nova-lvm job

2023-07-07 Thread sean mooney
this is a bug in devstack-plugin-ceph-multinode-tempest-py3 we need to backport 
https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/882987

** Also affects: devstack-plugin-ceph
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2025813

Title:
  test_rebuild_volume_backed_server failing 100% on nova-lvm job

Status in devstack-plugin-ceph:
  New
Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) antelope series:
  In Progress
Status in OpenStack Compute (nova) yoga series:
  Triaged
Status in OpenStack Compute (nova) zed series:
  Triaged

Bug description:
  After the tempest patch was merged [1] nova-lvm job started to fail
  with the following error in test_rebuild_volume_backed_server:

  
  Traceback (most recent call last):
File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in 
wrapper
  return f(*func_args, **func_kwargs)
File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
868, in test_rebuild_volume_backed_server
  self.get_server_ip(server, validation_resources),
File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in 
get_server_ip
  return compute.get_server_ip(
File "/opt/stack/tempest/tempest/common/compute.py", line 76, in 
get_server_ip
  raise lib_exc.InvalidParam(invalid_param=msg)
  tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When 
validation.connect_method equals floating, validation_resources cannot be None

  As discussed on IRC with Sean [2], the SSH validation is mandatory now
  which is disabled in the job config [2].

  [1] https://review.opendev.org/c/openstack/tempest/+/831018
  [2] 
https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38
  [3] 
https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack-plugin-ceph/+bug/2025813/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2028851] [NEW] Console output was empty in test_get_console_output_server_id_in_shutoff_status

2023-07-27 Thread sean mooney
Public bug reported:

test_get_console_output_server_id_in_shutoff_status

https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L713

is failing consistently in the nova-lvm job starting on July 24 with 132
failures in the last 3 days. https://tinyurl.com/kvcc9289


Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", 
line 728, in test_get_console_output_server_id_in_shutoff_status
self.wait_for(self._get_output)
  File "/opt/stack/tempest/tempest/api/compute/base.py", line 340, in wait_for
condition()
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", 
line 213, in _get_output
self.assertTrue(output, "Console output was empty.")
  File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue
raise self.failureException(msg)
AssertionError: '' is not true : Console output was empty.

its not clear why this has started failing. it may be a regression or a
latent race in the test that we are now failing.

def test_get_console_output_server_id_in_shutoff_status(self):
"""Test getting console output for a server in SHUTOFF status

Should be able to GET the console output for a given server_id
in SHUTOFF status.
"""

# NOTE: SHUTOFF is irregular status. To avoid test instability,
#   one server is created only for this test without using
#   the server that was created in setUpClass.
server = self.create_test_server(wait_until='ACTIVE')
temp_server_id = server['id']

self.client.stop_server(temp_server_id)
waiters.wait_for_server_status(self.client, temp_server_id, 'SHUTOFF')
self.wait_for(self._get_output)

the test does not wait for the VM to be sshable so its possible that we
are shutting off the VM before it is fully booted and no output has been
written to the console.

this failure has happened on multiple providers but only in the nova-lvm job.
the console behavior is unrelated to the storage backend but the lvm job i 
belive is using
lvm on a loopback file so the storage performance is likely slower then 
raw/qcow.

so perhaps the boot is taking longer and no output is being written.

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2028851

Title:
   Console output was empty in
  test_get_console_output_server_id_in_shutoff_status

Status in OpenStack Compute (nova):
  New

Bug description:
  test_get_console_output_server_id_in_shutoff_status

  
https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L713

  is failing consistently in the nova-lvm job starting on July 24 with
  132 failures in the last 3 days. https://tinyurl.com/kvcc9289

  
  Traceback (most recent call last):
File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
728, in test_get_console_output_server_id_in_shutoff_status
  self.wait_for(self._get_output)
File "/opt/stack/tempest/tempest/api/compute/base.py", line 340, in wait_for
  condition()
File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
213, in _get_output
  self.assertTrue(output, "Console output was empty.")
File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue
  raise self.failureException(msg)
  AssertionError: '' is not true : Console output was empty.

  its not clear why this has started failing. it may be a regression or
  a latent race in the test that we are now failing.

  def test_get_console_output_server_id_in_shutoff_status(self):
  """Test getting console output for a server in SHUTOFF status

  Should be able to GET the console output for a given server_id
  in SHUTOFF status.
  """

  # NOTE: SHUTOFF is irregular status. To avoid test instability,
  #   one server is created only for this test without using
  #   the server that was created in setUpClass.
  server = self.create_test_server(wait_until='ACTIVE')
  temp_server_id = server['id']

  self.client.stop_server(temp_server_id)
  waiters.wait_for_server_status(self.client, temp_server_id, 'SHUTOFF')
  self.wait_for(self._get_output)

  the test does not wait for the VM to be sshable so its possible that
  we are shutting off the VM before it is fully booted and no output has
  been written to the console.

  this failure has happened on multiple providers but only in the nova-lvm job.
  the console behavior is unrelated to the storage backend but the lvm job i 
belive is using
  lvm on a loopback fil

[Yahoo-eng-team] [Bug 2026831] Re: Table nova/pci_devices is not updated after removing attached SRIOV port

2023-08-30 Thread sean mooney
This is not a bug this is intentional bevhior added by 
https://github.com/openstack/nova/commit/26c41eccade6412f61f9a8721d853b545061adcc
To address https://bugs.launchpad.net/nova/+bug/1633120

** Changed in: nova
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2026831

Title:
  Table nova/pci_devices is not updated after removing attached SRIOV
  port

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Description
  ===

  When I create an SRIOV port and attach it to an instance then
  Nova/pci_devices db table for the VF is correctly updated and status
  of the VF is changed from "available" to "allocated". If I detach the
  port from the instance, the VF's status is also correctly reverted
  back to "available".

  But in case the port is deleted before it is detached from the
  instance, the VF's status stays "allocated" (in the db
  Nova/pci_devices) and it makes this VF unusable.


  Steps to reproduce
  ==

  
  1) create an SRIOV port in Openstack (VNIC type = Direct) and attach it to a 
VM
  2) delete the SRIOV port from Openstack without detaching it from the VM 
first  


  Expected result
  ===

  1) VF detached from the VM
  2) VF's status in database (Nova/pci_devices) changed to "available"

  
  Actual result
  =

  1) VF detached from the VM
  2) VF's status in database (Nova/pci_devices) IS NOT changed to "available", 
it stays "allocated"


  Environment
  ===
  1. Openstack version: Yoga

rpm -qa | grep nova
python3-novaclient-17.7.0-1.el8.noarch
openstack-nova-conductor-25.2.0-1.el8.noarch
python3-nova-25.2.0-1.el8.noarch
openstack-nova-common-25.2.0-1.el8.noarch
openstack-nova-scheduler-25.2.0-1.el8.noarch
openstack-nova-api-25.2.0-1.el8.noarch
openstack-nova-novncproxy-25.2.0-1.el8.noarch

  
  2. Which hypervisor did you use?

Libvirt + KVM

What's the version of that?

libvirt-7.6.0-6.el8.x86_64
qemu-kvm-6.0.0-33.el8.x86_64
 


  2. Which storage type did you use?

This issue is storage independent.


  3. Which networking type did you use?
 Neutron + openvswitch + sriovnicswitch

  Logs & Configs
  ==

  (hypervisor) nova-compute.log:

  Before:

  PciDevicePool(count=16,numa_node=0,product_id='XXX',tags={dev_type='type-
  
VF',parent_ifname='XXX',physical_network='XXX',remote_managed='false'},vendor_id='XXX')

  
  After:

  PciDevicePool(count=15,numa_node=0,product_id='XXX',tags={dev_type='type-
  
VF',parent_ifname='XXX',physical_network='XXX',remote_managed='false'},vendor_id='XXX')

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2026831/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1983863] Re: Can't log within tpool.execute

2023-09-11 Thread sean mooney
adding nova as the change to fix this is breaking our unit tests.
https://review.opendev.org/c/openstack/nova/+/894538 corrects this
setting as critical as this is blocking the bump of upper constratis to include 
oslo.log 5.3.0

i don't think there is any  real-world impact beyond that.

** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => In Progress

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova
   Importance: Undecided => Critical

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1983863

Title:
  Can't log within tpool.execute

Status in OpenStack Compute (nova):
  In Progress
Status in oslo.log:
  Fix Released

Bug description:
  There is a bug in eventlet where logging within a native thread can
  lead to a deadlock situation:
  https://github.com/eventlet/eventlet/issues/432

  When encountered with this issue some projects in OpenStack using
  oslo.log, eg. Cinder, resolve them by removing any logging withing
  native threads.

  There is actually a better approach.  The Swift team came up with a
  solution a long time ago, and it would be great if oslo.log could use
  this workaround automaticaly:
  
https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1983863/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2039381] Re: Regarding Nova's inability to delete the Cinder volume for creating virtual machines (version Y)

2023-11-08 Thread sean mooney
reviewing the steps rene performed and the initial bug description this
work flow is not supported

nova has never supported attaching a volume to a guest via the cidner API 
and detaching it has been explicitly blocked due to the cve exposures 

so for nova i belive this is invalid.

cinder likely should prevent normal user form creating attachments for a
nova instance with the same mitigation as the detach case.

creating a volume attachment for a nova instance should require a service token 
with the service role
just as delete does.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2039381

Title:
  Regarding Nova's inability to delete the Cinder volume for creating
  virtual machines (version Y)

Status in Cinder:
  Confirmed
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  When creating a virtual machine in the dashboard, create a volume and
  choose to delete the virtual machine while also deleting the volume.
  When deleting the virtual machine, there is no normal uninstallation
  of the volume and the volume is not deleted.

  The relevant error logs are shown in the image, but the openstack CLI
  can delete its volume. The specific commands are as follows.

  CLI:

  source /etc/keystone/admin-openrc.sh (Verify password file)
  openstack volume set --detached 191e555c-3947-4928-be46-9f09e2190877(volumeID)
  openstack volume delete  191e555c-3947-4928-be46-9f09e2190877(volumeID)

  It seems that Nova is unable to interact with the Cinder API to
  delete(or detached) commands, but I am not very professional. I don't
  know if it's a bug?

  此错误跟踪器适用于文档错误,请使用以下内容作为模板,并根据需要删除或添加字段。将 [ ] 转换为 [x] 以复选框:

  - [ ] 此文档以这种方式不准确:__
  - [ ] 这是一个文档添加请求。
  - [ ] 我对文档有一个修复程序,我可以粘贴到下面,包括示例:输入和输出。

  如果您有故障排除或支持问题,请使用以下资源:

  - 邮件列表:https://lists.openstack.org
   - IRC:电讯局的「开放栈」频道

  ---
  发布: 25.2.2.dev1 在 2019-10-08 11:20:05
  SHA: fd0d336ab5be71917ef9bd94dda51774a697eca8
  来源: https://opendev.org/openstack/nova/src/doc/source/install/index.rst
  网址: https://docs.openstack.org/nova/yoga/install/

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2039381/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2051108] Re: Support for the "bring your own keys" approach for Cinder

2024-01-31 Thread sean mooney
for cinder this would likely require a spec as its an api change to be
able to pass the barbican secrete i belive.

for nova this might be a specless blueprint if the changes were minor
enough and we coudl capture the details in the cinder spec otherwisse we
would need a spec for nova as well.

in either case this is not a bug in the scope of nova so ill make the
nova part as invild form a paper work prespective since this would be
tracked as a nova blueprint in lancuchpad with or without a spec not as
a bug.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2051108

Title:
  Support for the "bring your own keys" approach for Cinder

Status in Cinder:
  New
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===
  Cinder currently lags support the API to create a volume with a predefined 
(e.g. already stored in Barbican) encryption key. This feature would be useful 
for use cases where end-users should be enabled to store keys later on used to 
encrypt volumes.

  Work flow would be as follow:
  1. End user creates a new key and stores it in OpenStack Barbican
  2. User requests a new volume with volume type "LUKS" and gives an 
"encryption_reference_key_id" (or just "key_id").
  3. Internally the key is copied (like in 
volume_utils.clone_encryption_key_()) and a new "encryption_key_id".

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2051108/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2007968] Re: Flavors may not meet the image minimum requirement when resize

2024-02-01 Thread sean mooney
** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: horizon
   Status: New => Invalid

** Changed in: nova
   Status: New => Triaged

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
 Assignee: (unassigned) => zhou zhong (zhouzhongg)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2007968

Title:
  Flavors may not meet the image minimum requirement when resize

Status in OpenStack Dashboard (Horizon):
  Invalid
Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Description
  ===
  When resize instance, the flavors returned may not meet the image minimum 
memory requirement, resizing instance ignores the minimum memory limit of the 
image, which may cause the resizing be successfully, but the instance fails to 
start because the memory is too small to run the system.

  Steps to reproduce
  ==
  1.create an instance with image min_ram 4096
  2.resize the instance
  3.watch the returned flavors

  Expected result
  ===
  do not include the flavors which memory less than 4096.

  Actual result
  =
  returned all of the visible flavors.

  Environment
  ===

  Logs & Configs
  ==

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/2007968/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2052718] Re: Nova Compute Service status goes up and down abnormally

2024-02-08 Thread sean mooney
I dont belive this is in the scope of nova to fix.

the requirement to have consistent time synchronisation is well know and
it strongly feels like a problem that should be address in an
installation too not in code.

we mention that the controllers should be rujing shared service like ntp in the 
docs
https://docs.openstack.org/nova/latest/install/overview.html#controller


if you have not ensured your clocks are in sync as part of the installation 
process via ntp, ptp or another method then i would not consider OpenStack to 
be correctly installed.

** Changed in: nova
   Status: New => Opinion

** Changed in: nova
   Importance: Undecided => Wishlist

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2052718

Title:
  Compute service status still up with nagative elapsed time

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Hi community,

  When you type:
  $ openstack nova compute service list

  The status you will see "up" status but actually it is running wrong
  logic because elapsed time is a negative number. This is caused by the
  abs(elapsed) function turning it into a positive integer.

  Around the abs(elapsed) line of code ->
  
https://github.com/openstack/nova/blob/stable/2023.2/nova/servicegroup/drivers/db.py

  ...
  ...
  def is_up(self, service_ref):
  ...
  ...
  # Timestamps in DB are UTC.
  elapsed = timeutils.delta_seconds(last_heartbeat, timeutils.utcnow())
  is_up = abs(elapsed) <= self.service_down_time
  if not is_up:
  LOG.debug('Seems service %(binary)s on host %(host)s is down. '
    'Last heartbeat was %(lhb)s. Elapsed time is %(el)s',
    {'binary': service_ref.get('binary'),
     'host': service_ref.get('host'),
     'lhb': str(last_heartbeat), 'el': str(elapsed)})
  return is_up
  ...
  ...

  service_down_time (threshold): 60s
  https://github.com/openstack/nova/blob/stable/2023.2/nova/conf/service.py#L40

  === Bad result ===

  Example (1) bug:

  last_heartbeat: 10:00:00 AM
  now: 9:09:30 AM
  elapsed: -30(s)
  abs(-30s) < 60s
  ===> result: up

  Example (2) bug:

  last_heartbeat: 10:01:00 AM
  now: 9:09:58 AM
  elapsed: -62(s)
  abs(-30s) < 60s

  ===> result: down

  === Expected result
  ===

  Example (1) good expectations:
  last_heartbeat: 10:00:00 AM
  now: 9:09:30 AM
  elapsed: -30(s) < 0
  ===> result: logging error and down

  Example (2) good expectations:

  last_heartbeat: 10:01:00 AM
  now: 9:09:58 AM
  elapsed: -62(s) < 0
  ===> result: logging error and down

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2052718/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2052937] Re: Policy: binding operations are prohibited for service role

2024-02-22 Thread sean mooney
nova has a job that was using a post hook for some extra sanity checks
https://review.opendev.org/c/openstack/nova/+/909859
i have removed that but until that merges nova-next is blocked.

** Also affects: nova
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: New => In Progress

** Changed in: nova
   Importance: Undecided => Critical

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Tags added: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2052937

Title:
  Policy: binding operations are prohibited for service role

Status in neutron:
  Fix Released
Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Create/update port binding:* policies are admin only, which prevents
  for example ironic service user with service role to manage baremetal
  ports:

  
  "http://192.0.2.10:9292";, "region": "RegionOne"}], "id": 
"e6e42ef4fc984e71b575150e59a92704", "type": "image", "name": "glance"}]}} 
get_auth_ref 
/var/lib/kolla/venv/lib64/python3.9/site-packages/keystoneauth1/identity/v3/base.py:189
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron [None 
req-6737aef3-c823-4f7c-95ec-1c9f38b14faa a4dbb0dc59024c199843cea86603308b 
9fd64a4cbd774756869cb3968de2e9b6 - - default default] Unable to clear binding 
profile for neutron port 291dbb7b-5cc8-480d-b39d-eb849bcb4a64. Error: 
ForbiddenException: 403: Client Error for url: 
http://192.0.2.10:9696/v2.0/ports/291dbb7b-5cc8-480d-b39d-eb849bcb4a64, 
((rule:update_port and rule:update_port:binding:host_id) and 
rule:update_port:binding:profile) is disallowed by policy: 
openstack.exceptions.ForbiddenException: ForbiddenException: 403: Client Error 
for url: 
http://192.0.2.10:9696/v2.0/ports/291dbb7b-5cc8-480d-b39d-eb849bcb4a64, 
((rule:update_port and rule:update_port:binding:host_id) and 
rule:update_port:binding:profile) is disallowed by policy
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron Traceback (most recent 
call last):
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/ironic/common/neutron.py", 
line 130, in unbind_neutron_port
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron 
update_neutron_port(context, port_id, attrs_unbind, client)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/ironic/common/neutron.py", 
line 109, in update_neutron_port
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return 
client.update_port(port_id, **attrs)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/network/v2/_proxy.py",
 line 2992, in update_port
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return 
self._update(_port.Port, port, if_revision=if_revision, **attrs)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/proxy.py", line 
61, in check
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return method(self, 
expected, actual, *args, **kwargs)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/network/v2/_proxy.py",
 line 202, in _update
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return 
res.commit(self, base_path=base_path, if_revision=if_revision)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 
1803, in commit
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return self._commit(
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 
1848, in _commit
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron 
self._translate_response(response, has_body=has_body)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 
1287, in _translate_response
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron 
exceptions.raise_from_response(response, error_message=error_message)
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron   File 
"/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/exceptions.py", 
line 250, in raise_from_response
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron raise cls(
  2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron 
openstack.exceptions.ForbiddenException: ForbiddenException: 403: Client Error 
for url: 
htt

[Yahoo-eng-team] [Bug 2054797] Re: Unshelve can cause quota over-consumption

2024-02-23 Thread sean mooney
*** This bug is a duplicate of bug 2003991 ***
https://bugs.launchpad.net/bugs/2003991

this sound like you have count_usage_from_placement=true

https://docs.openstack.org/nova/latest/configuration/config.html#quota.count_usage_from_placement

in which case this is not a bug and is the intended behavior

thee was a bug related to this which i belvie was fixed recently 
it may or may not be backproted to yoga

** This bug has been marked a duplicate of bug 2003991
   Quota not properly enforced during unshelve when 
[quota]count_usage_from_placement = True

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054797

Title:
  Unshelve can cause quota over-consumption

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  Unshelving a VM can cause an over consumption of a project's quota. I'm not 
sure if this is a bug or if it's actually intended behaviour, but in my opinion 
this should not be possible since this will allow users to potentially use a 
lot more resources than their intended quota.

  Steps to reproduce
  ==
  * Create a project with a quota of i.e. 4 CPUs and 4GB of RAM
  * Create server1 with 2 CPUs and 2GB RAM, and shelve it after it successfully 
spawns
  * When server1 in shelved, create server2 with 4 CPUs and 4GB of RAM 
(effectively using up the entire CPU and RAM quota of the project)
  * Unshelve server1

  Expected result
  ===
  I would then expect that unshelving server1 would fail, since the quota was 
used up by server2

  Actual result
  =
  Unshelving server1 is completed, and I have now used 6 of 4 CPUs and 6 of 4GB 
RAM on my project's quota. FWIW this also works if at the time of unshelving 
the quota is already used up.

  Environment
  ===
  Openstack Yoga
  nova-api 3:25.1.1-0ubuntu1~cloud0
  nova-scheduler 3:25.1.1-0ubuntu1~cloud0

  Running KVM/libvirt on Ubuntu 20.04 and Ceph 17.x

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054797/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2003991] Re: Quota not properly enforced during unshelve when [quota]count_usage_from_placement = True

2024-02-23 Thread sean mooney
https://review.opendev.org/q/topic:%22bug/2003991%22

note there were bugs backport filed back to train but those branches are
now unsupproted.

** Also affects: nova/yoga
   Importance: Undecided
   Status: New

** Also affects: nova/xena
   Importance: Undecided
   Status: New

** Also affects: nova/zed
   Importance: Undecided
   Status: New

** Also affects: nova/antelope
   Importance: Undecided
   Status: New

** Also affects: nova/wallaby
   Importance: Undecided
   Status: New

** Also affects: nova/train
   Importance: Undecided
   Status: New

** Also affects: nova/victoria
   Importance: Undecided
   Status: New

** Also affects: nova/ussuri
   Importance: Undecided
   Status: New

** Changed in: nova/antelope
   Status: New => Fix Released

** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova/antelope
   Importance: Undecided => Medium

** Changed in: nova/train
   Status: New => Won't Fix

** Changed in: nova/ussuri
   Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2003991

Title:
  Quota not properly enforced during unshelve when
  [quota]count_usage_from_placement = True

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) antelope series:
  Fix Released
Status in OpenStack Compute (nova) train series:
  Won't Fix
Status in OpenStack Compute (nova) ussuri series:
  Won't Fix
Status in OpenStack Compute (nova) victoria series:
  New
Status in OpenStack Compute (nova) wallaby series:
  New
Status in OpenStack Compute (nova) xena series:
  New
Status in OpenStack Compute (nova) yoga series:
  New
Status in OpenStack Compute (nova) zed series:
  New

Bug description:
  When nova is configured to count quota usage from placement [1], there
  are some behaviors that are different from the legacy quota resource
  counting.

  With legacy quotas, all of an instance's resources remained consumed
  from a quota perspective while the instance was SHELVED_OFFLOADED.
  Because of this, there was no need to check quota when doing an
  unshelve and an unshelve request could not be blocked for quota
  related reasons. The quota usage remained the same whether the
  instance was SHELVED_OFFLOADED or not.

  With counting quota usage from placement, cores and ram resource usage
  is counted from placement while instances are counted from the API
  database. And when an instance is SHELVED_OFFLOADED, it does not have
  any resource allocations in placement for cores and ram during that
  time. Because of this, it is possible to go over cores and ram quota
  after unshelving an instance as new resources will be allocated in
  placement for the unshelved instance.

  The unshelve quota scenario is currently not being properly enforced
  because there are no quota checks in the scheduling code path, so when
  the unshelving instance goes through the scheduling process, it is not
  validated against quota. There needs to be a dedicated quota check for
  unshelve.

  [1] https://docs.openstack.org/nova/latest/admin/quotas.html#quota-
  usage-from-placement

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2003991/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2055245] Re: DHCP Option is not passed to VM via Cloud-init

2024-02-28 Thread sean mooney
this is a neutron bug not a nova one.

the behavior should not change between using the dhcp aganet and native
dhcp

** Also affects: neutron
   Importance: Undecided
   Status: New

** Changed in: nova
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2055245

Title:
  DHCP Option is not passed to VM via Cloud-init

Status in neutron:
  New
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===

  Nova-Metadata-API doesn't provide ipv4_dhcp type for OVN (native OVH
  DHCP feature, no DHCP agents) networks with dhcp_enabled but no
  default gateway.

  Problem seems to be in
  
https://opendev.org/openstack/nova/src/branch/master/nova/network/neutron.py#L3617

  There is just an exception to networks without device_owner:
  network:dhcp where default gateway is used, which doesn't cover this
  case.

  Steps to reproduce
  ==

  Create a OVN network in an environment where native DHCP feature is
  provided by ovn (no ml2/ovs DHCP Agents). In addition this network
  needs to have no default gateway enabled.

  Create VM in this network and observe the cloud-init process
  (network_data.json)

  Expected result
  ===

  network_data.json
  (http://169.254.169.254/openstack/2018-08-27/network_data.json) should
  return something like:

  {
"links": [
  {
"id": "tapddc91085-96",
"vif_id": "ddc91085-9650-4b7b-ad9d-b475bac8ec8b",
"type": "ovs",
"mtu": 1442,
"ethernet_mac_address": "fa:16:3e:93:49:fa"
  }
],
"networks": [
  {
"id": "network0",
"type": "ipv4_dhcp",
"link": "tapddc91085-96",
"network_id": "9f61a3a7-26d3-4013-b61d-12880b325ea9"
  }
],
"services": []
  }

  Actual result
  =

  {
"links": [
  {
"id": "tapddc91085-96",
"vif_id": "ddc91085-9650-4b7b-ad9d-b475bac8ec8b",
"type": "ovs",
"mtu": 1442,
"ethernet_mac_address": "fa:16:3e:93:49:fa"
  }
],
"networks": [
  {
"id": "network0",
"type": "ipv4",
"link": "tapddc91085-96",
"ip_address": "10.0.0.40",
"netmask": "255.255.255.0",
"routes": [],
"network_id": "9f61a3a7-26d3-4013-b61d-12880b325ea9",
"services": []
  }
],
"services": []
  }

  Environment
  ===

  Openstack Zed with Neutron OVN feature enabled

  Nova: 26.2.1

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2055245/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2058928] [NEW] instance action events are b0rked AF

2024-03-25 Thread sean mooney
Public bug reported:

Long version instance actions are ment to have a start and ending

ideally one of Success or failure with the option to have intermediate
events for complex operation like resize.


For at least interface attach and detach that does not happen Possible all 
instance actions that are casts…

We are sending notificaitons for atttach/detach start, end and failure 
referencing the instance action type in the notification. i.e. 
interface_attach.start interface_attach.end
But we are not recording any finish events which means today there is no 
non-racy way to poll the detach action for its without resorting to instance 
show and partsing the address field to see the ip go away…
note that that is also cached so that wont happen until the network info cache 
for the instance is updated and that only works for event that have visible 
sideffect observable on the instance object.

We shoudl fix this for all instance actions and add functional test
coverage and assert when they complete, with error or success,  that we
have actually updated the db with the event completion.

Today in the functional tests we use the notifications or filed on the
server object to know if it is complete but never check the instance
action events in the db.

We have test helper already to poll for the completion fo instance
action events (IAEs)
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L173-L206

But we dont use them for volume detach for example 
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238
 because we dont complete the action by sending the event. We have test helpers 
for most of the instance actions in this file 
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238
 and they all either wait for state changes on the server or notificaiotns to 
know when the action has completed because we are missing the
Code to complete the event…

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: api compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2058928

Title:
  instance action events are b0rked AF

Status in OpenStack Compute (nova):
  New

Bug description:
  Long version instance actions are ment to have a start and ending

  ideally one of Success or failure with the option to have intermediate
  events for complex operation like resize.

  
  For at least interface attach and detach that does not happen Possible all 
instance actions that are casts…

  We are sending notificaitons for atttach/detach start, end and failure 
referencing the instance action type in the notification. i.e. 
interface_attach.start interface_attach.end
  But we are not recording any finish events which means today there is no 
non-racy way to poll the detach action for its without resorting to instance 
show and partsing the address field to see the ip go away…
  note that that is also cached so that wont happen until the network info 
cache for the instance is updated and that only works for event that have 
visible sideffect observable on the instance object.

  We shoudl fix this for all instance actions and add functional test
  coverage and assert when they complete, with error or success,  that
  we have actually updated the db with the event completion.

  Today in the functional tests we use the notifications or filed on the
  server object to know if it is complete but never check the instance
  action events in the db.

  We have test helper already to poll for the completion fo instance
  action events (IAEs)
  
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L173-L206

  But we dont use them for volume detach for example 
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238
 because we dont complete the action by sending the event. We have test helpers 
for most of the instance actions in this file 
https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238
 and they all either wait for state changes on the server or notificaiotns to 
know when the action has completed because we are missing the
  Code to complete the event…

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2058928/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2002400] Re: When adding ironic compute host to an aggregate, only one ironic compute node is added to placement aggregate

2024-04-15 Thread sean mooney
** Changed in: nova
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2002400

Title:
  When adding ironic compute host to an aggregate, only one ironic
  compute node is added to placement aggregate

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  The reason seems to be this line
  
https://opendev.org/openstack/nova/src/commit/ba9d4c909beff4e9ab86911a35dd5db8d8ce08d6/nova/compute/api.py#L6646

  nodes = objects.ComputeNodeList.get_all_by_host(context, host_name)
  node_name = nodes[0].hypervisor_hostname

  
  While OK for libvirt and such, this is not OK for compute services that 
manage many 'nodes/hypervisors' - e.g. ironic virt driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2002400/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1542491] Re: Scheduler update_aggregates race causes incorrect aggregate information

2024-04-24 Thread sean mooney
** Changed in: nova
   Status: Confirmed => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1542491

Title:
  Scheduler update_aggregates race causes incorrect aggregate
  information

Status in OpenStack Compute (nova):
  Opinion
Status in Ubuntu:
  Invalid

Bug description:
  It appears that if nova-api receives simultaneous requests to add a
  server to a host aggregate, then a race occurs that can lead to nova-
  scheduler having incorrect aggregate information in memory.

  One observed effect of this is that sometimes nova-scheduler will
  think a smaller number of hosts are a member of the aggregate than is
  in the nova database and will filter out a host that should not be
  filtered.

  Restarting nova-scheduler fixes the issue, as it reloads the aggregate
  information on startup.

  Nova package versions: 1:2015.1.2-0ubuntu2~cloud0

  Reproduce steps:

  Create a new os-aggregate and then populate an os-aggregate with
  simultaneous API POSTs, note timestamps:

  2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server 
[req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 
439 time: 0.1865470
  2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server 
[req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.2995598
  2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server 
[req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 519 time: 0.3140590
  2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server 
[req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.3759601
  2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server 
[req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.3433032

  Schedule a VM

  Expected Result:
  nova-scheduler Availability Zone filter returns all members of the aggregate

  Actual Result:
  nova-scheduler believes there is only one hypervisor in the aggregate. The 
number will vary as it is a race:

  2016-02-05 07:48:04.411 13600 DEBUG nova.filters 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) 
get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70
  2016-02-05 07:48:04.411 13600 DEBUG nova.filters 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) 
get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84
  2016-02-05 07:48:04.412 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. 
(oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova 
host_passes 
/usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.412 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. 
(oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova 
host_passes 
/usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.413 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. 
(oshv1, oshv1) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova 
host_passes 
/usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.413 13600 DEBUG nova.filters 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Filter AvailabilityZoneFilter returned 
1 host(s) get_filtered_objects 
/usr/lib/python2.7/dist-pack

[Yahoo-eng-team] [Bug 1542491] Re: Scheduler update_aggregates race causes incorrect aggregate information

2024-04-29 Thread sean mooney
setting this to medium severity

there is an existing race in how the cache is updated.
the workaround is to periodically restart the scheduled to clear the cache.

this looks like it affects all stable releases of OpenStack.
however its unlikely but not impossible that a fix for this can be backported.

given the above I'm marking this as medium as there is a relatively simple 
workaround even if the detection of the
isuee is not trivial.


** Changed in: nova
   Importance: Undecided => Medium

** Changed in: nova
   Status: Opinion => Triaged

** Changed in: nova
 Assignee: jingtao (liang888) => (unassigned)

** Tags added: api

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1542491

Title:
  Scheduler update_aggregates race causes incorrect aggregate
  information

Status in OpenStack Compute (nova):
  Triaged
Status in Ubuntu:
  Invalid

Bug description:
  It appears that if nova-api receives simultaneous requests to add a
  server to a host aggregate, then a race occurs that can lead to nova-
  scheduler having incorrect aggregate information in memory.

  One observed effect of this is that sometimes nova-scheduler will
  think a smaller number of hosts are a member of the aggregate than is
  in the nova database and will filter out a host that should not be
  filtered.

  Restarting nova-scheduler fixes the issue, as it reloads the aggregate
  information on startup.

  Nova package versions: 1:2015.1.2-0ubuntu2~cloud0

  Reproduce steps:

  Create a new os-aggregate and then populate an os-aggregate with
  simultaneous API POSTs, note timestamps:

  2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server 
[req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 
439 time: 0.1865470
  2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server 
[req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.2995598
  2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server 
[req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 519 time: 0.3140590
  2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server 
[req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.3759601
  2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server 
[req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST 
/v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 
200 len: 506 time: 0.3433032

  Schedule a VM

  Expected Result:
  nova-scheduler Availability Zone filter returns all members of the aggregate

  Actual Result:
  nova-scheduler believes there is only one hypervisor in the aggregate. The 
number will vary as it is a race:

  2016-02-05 07:48:04.411 13600 DEBUG nova.filters 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) 
get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70
  2016-02-05 07:48:04.411 13600 DEBUG nova.filters 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) 
get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84
  2016-02-05 07:48:04.412 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. 
(oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova 
host_passes 
/usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.412 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 
326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. 
(oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova 
host_passes 
/usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62
  2016-02-05 07:48:04.413 13600 DEBUG 
nova.scheduler.filters.availability_zone_filter 
[req-c

[Yahoo-eng-team] [Bug 2073862] Re: test_vmdk_bad_descriptor_mem_limit and test_vmdk_bad_descriptor_mem_limit_stream_optimized fail if qemu-img binary is missing

2024-07-30 Thread sean mooney
** Also affects: nova/bobcat
   Importance: Undecided
   Status: New

** Also affects: nova/antelope
   Importance: Undecided
   Status: New

** Also affects: nova/2024.1
   Importance: Undecided
   Status: New

** Changed in: nova
   Importance: Undecided => Low

** Changed in: nova/antelope
   Importance: Undecided => Low

** Changed in: nova/antelope
   Status: New => Triaged

** Changed in: nova/2024.1
   Status: New => Triaged

** Changed in: nova/bobcat
   Status: New => Triaged

** Changed in: nova/2024.1
   Importance: Undecided => Low

** Changed in: nova/bobcat
   Importance: Undecided => Low

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2073862

Title:
  test_vmdk_bad_descriptor_mem_limit and
  test_vmdk_bad_descriptor_mem_limit_stream_optimized fail if qemu-img
  binary is missing

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
  In Progress
Status in OpenStack Compute (nova) antelope series:
  Triaged
Status in OpenStack Compute (nova) bobcat series:
  Triaged

Bug description:
  When qemu-img binary is not present on the system, these tests fail
  like we can see on these logs:

  ==
  ERROR: 
nova.tests.unit.image.test_format_inspector.TestFormatInspectors.test_vmdk_bad_descriptor_mem_limit
  --
  pythonlogging:'': {{{
  2024-07-23 11:44:54,011 WARNING [oslo_policy.policy] JSON formatted 
policy_file support is deprecated since Victoria release. You need to use YAML 
format which will be default in future. You can use 
``oslopolicy-convert-json-to-yaml`` tool to convert existing JSON-formatted 
policy file to YAML-formatted in backward compatible way: 
https://docs.openstack.org/oslo.policy/latest/cli/oslopolicy-convert-json-to-yaml.html.
  2024-07-23 11:44:54,012 WARNING [oslo_policy.policy] JSON formatted 
policy_file support is deprecated since Victoria release. You need to use YAML 
format which will be default in future. You can use 
``oslopolicy-convert-json-to-yaml`` tool to convert existing JSON-formatted 
policy file to YAML-formatted in backward compatible way: 
https://docs.openstack.org/oslo.policy/latest/cli/oslopolicy-convert-json-to-yaml.html.
  2024-07-23 11:44:54,015 WARNING [oslo_policy.policy] Policy Rules 
['os_compute_api:extensions', 'os_compute_api:os-floating-ip-pools', 
'os_compute_api:os-quota-sets:defaults', 
'os_compute_api:os-availability-zone:list', 'os_compute_api:limits', 
'project_member_api', 'project_reader_api', 'project_member_or_admin', 
'project_reader_or_admin', 'os_compute_api:limits:other_project', 
'os_compute_api:os-lock-server:unlock:unlock_override', 
'os_compute_api:servers:create:zero_disk_flavor', 
'compute:servers:resize:cross_cell', 
'os_compute_api:os-shelve:unshelve_to_host'] specified in policy files are the 
same as the defaults provided by the service. You can remove these rules from 
policy files which will make maintenance easier. You can detect these redundant 
rules by ``oslopolicy-list-redundant`` tool also.
  }}}

  Traceback (most recent call last):
File 
"/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py",
 line 408, in test_vmdk_bad_descriptor_mem_limit
  self._test_vmdk_bad_descriptor_mem_limit()
File 
"/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py",
 line 382, in _test_vmdk_bad_descriptor_mem_limit
  img = self._create_allocated_vmdk(image_size // units.Mi,
File 
"/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py",
 line 183, in _create_allocated_vmdk
  subprocess.check_output(
File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
  return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/usr/lib/python3.10/subprocess.py", line 526, in run
  raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command 'qemu-img convert -f raw -O vmdk -o 
subformat=monolithicSparse -S 0 
/tmp/tmpw0q0ibvj/nova-unittest-formatinspector--monolithicSparse-wz0i4kj1.raw 
/tmp/tmpw0q0ibvj/nova-unittest-formatinspector--monolithicSparse-qpo78jee.vmdk' 
returned non-zero exit status 127.


  
  ==
  ERROR: 
nova.tests.unit.image.test_format_inspector.TestFormatInspectors.test_vmdk_bad_descriptor_mem_limit_stream_optimized
  --
  pythonlogging:'': {{{
  2024-07-23 11:43:31,443 WARNING [oslo_policy.policy] JSON formatted 
policy_file support is deprecated since Victoria release. You need to use YAML 
format which will be default in future. You can use 
``oslopolicy-convert-json

[Yahoo-eng-team] [Bug 2033401] Re: sanitize_hostname is not alligned with idna2 specification

2024-07-30 Thread sean mooney
Nova does not support internationalised hostnames so it does not support
https://www.rfc-editor.org/rfc/rfc5891

the conversion of the display name to a hostname is the best effort and
we make no guarantee of its validity for DNS.

the conversion utility is intended to produce a valid hostname name but
it not intended to ba a domain name

nova could be enhanced to provide that functionality but  i would be
more inclined to remove the defautlign of the host name by converting
the displayname and instead use the other fallback we already have which
is to default to server- in a new API microversion.


** Changed in: nova
   Status: In Progress => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2033401

Title:
  sanitize_hostname is not alligned with idna2 specification

Status in OpenStack Compute (nova):
  Opinion

Bug description:
  DNSmask was switched to IDN2 specification more than 4 year ago Debian 
package [0]
  According to specification name with -- in 3rd and 4th characters is not 
allowed. See RFC 5891 [1]
  As result hostnames for example (rf--xx), generates error on DNSmasq side, 
and no longer works

  Aug 29 10:55:32 dnsmasq[243]: bad DHCP host name at line 2 of
  /var/lib/neutron/dhcp/6531ba54-0aa1-4b3b-b098-49bb0cfd586b/host

  cat /var/lib/neutron/dhcp/6531ba54-0aa1-4b3b-b098-49bb0cfd586b/host
  
fa:16:3e:d9:ba:17,amphora-ccee6c76-e565-496d-b841-f485a99dc865.openstack.internal.,10.10.10.142
  
fa:16:3e:c8:93:56,re--test-database-7ezitojxojun-server-01-lrdygbkrxkho.openstack.internal.,10.10.10.209
  fa:16:3e:29:dc:fc,host-10-10-10-45.openstack.internal.,10.10.10.45
  fa:16:3e:1a:be:3f,host-10-10-10-103.openstack.internal.,10.10.10.103
  fa:16:3e:bd:ab:2a,host-10-10-10-1.openstack.internal.,10.10.10.1
  fa:16:3e:df:b7:c1,host-10-10-10-118.openstack.internal.,10.10.10.118

  [0] 
https://github.com/imp/dnsmasq/commit/5a9133498562a0b69b287ad675ed3946803ea90c
  [1] https://www.rfc-editor.org/rfc/rfc5891#section-4.2.3.1

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2033401/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2067757] Re: AMD server do not support nested virtualization

2024-08-14 Thread sean mooney
that not the reason, rhel based distos disabel nested virt on amd by
default if i recall correctly and you have to explicitly enabled it.

its not supported on RHEL and is considered tech preview as there are several 
know bugs.
intel is also not supported downstream for production workload however it much 
more mature and i bleive its  enabled by defuat.

nova is not filtering  out svm.

setting cpu_mode=none effectively is the same as cpu_mode=host-model

so either libvirt is disbalinging it or its a kernel  default issue.

in either case i don't think this is a valid nova bug.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2067757

Title:
  AMD server do not support nested virtualization

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  From Linux kernel v4.19 onwards, the nested KVM parameter is enabled
  by default for Intel and AMD. (Though your Linux distribution might
  override this default, here is the official documentation of this:
  https://www.kernel.org/doc/html/v5.7/virt/kvm/running-nested-
  guests.html

  We are using OpenStack Zed on CentOS 9 and the VM is running on AMD
  compute nodes, and the kernel version is: 5.14.0-386.el9.x86_64.

  When we created an instance on AMD server and set the "cpu_mode" to
  "none", we found that the "svm" feature is passed to the instance XML
  on libvirt, but when we execu "lscpu" inside the VM, we can not see
  the "svm" feature, so we could not create a L2 instance inside the VM.

  However, when we set the "cpu_mode" to "host-passthrough" and hard
  reboot the VM, the "svm" is set correctly within the VM.

  For intel servers, we can create nested instances by default, and the
  "cpu_mode" is also set to "none", and everything works well.

  We guess it might because of some CPU feature dependencies which cause
  this issue. Can you help us to take a look? Thanks

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2067757/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2059800] Re: Image download immediately fails when glance returns 500

2024-09-03 Thread sean mooney
** Changed in: nova
   Status: In Progress => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2059800

Title:
  Image download immediately fails when glance returns 500

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Description
  ===
  nova-compute downloads a vm image from glance when launching an instance. It 
retries requests when it gets 503, but it does not when it gets 500.
  When glance uses cinder backend and a image volume is still used (for example 
because another client is downloading the same image), glance returns 500 and 
this results in immediate instance creation failure.

  Steps to reproduce
  ==
  * Deploy glance with cinder image store
  * Upload an image
  * Create an image-boot instance from the image, while downloading the image 
in background

  Expected result
  ===
  Instance creation succeeeds

  Actual result
  =
  Instance creation fails because of 500 error from glance

  Environment
  ===
  This has been seen in Puppet OpenStack integration job, which uses RDO master.

  Logs & Configs
  ==
  Example failure can be found in 
https://zuul.opendev.org/t/openstack/build/fc0e584a70f947d988ac057a8cc991c2

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2059800/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 2080556] [NEW] old nova instance cant be started on post victoria deployments

2024-09-12 Thread sean mooney
Public bug reported:

Downstream we had an interesting but report
https://bugzilla.redhat.com/show_bug.cgi?id=2311875

Instances created after liberty but before victoria
that request a numa topology but do not have CPU pinning
cannot be started on post victoria nova.

as part of the 
https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
spec we started tracking cpus as PCVU and VCPU resource classes but since a 
given instance
would either have pinned cpus or floating cpus  no changes too the instance 
numa topology object
were required.

with the introduction of mixed cpus in a single instnace

https://specs.openstack.org/openstack/nova-
specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html

the instnace numa topology object was extended with a new pcpuset field.

as part of that work the _migrate_legacy_object function was extended to 
default pcpuset to an empty set
https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R212
for numa topologies that predate ovo

and

an new _migrate_legacy_dedicated_instance_cpuset function was added to
migrate existing pinned instances and instnace with ovo in the  db.

what we missed in the review is that unpinned guests should have had the 
cell.pcpuset set to the empty set
here
https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R178

The new filed is not nullable and is not present in the existing json 
serialised object
as a result accessing cell.pcpuset on object returned form the db will raise a 
NotImplementedError because it is unset if the VM was created between liberty 
and victoria.
this only applies to non-pinned vms with a numa topology i.e. 
hw:mem_page_size= or hw:numa_nodes=

** Affects: nova
 Importance: High
 Assignee: sean mooney (sean-k-mooney)
 Status: In Progress


** Tags: numa

** Changed in: nova
 Assignee: (unassigned) => sean mooney (sean-k-mooney)

** Changed in: nova
   Importance: Undecided => High

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2080556

Title:
  old nova instance cant be started on post victoria deployments

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Downstream we had an interesting but report
  https://bugzilla.redhat.com/show_bug.cgi?id=2311875

  Instances created after liberty but before victoria
  that request a numa topology but do not have CPU pinning
  cannot be started on post victoria nova.

  as part of the 
  
https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html
  spec we started tracking cpus as PCVU and VCPU resource classes but since a 
given instance
  would either have pinned cpus or floating cpus  no changes too the instance 
numa topology object
  were required.

  with the introduction of mixed cpus in a single instnace

  https://specs.openstack.org/openstack/nova-
  specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html

  the instnace numa topology object was extended with a new pcpuset
  field.

  as part of that work the _migrate_legacy_object function was extended to 
default pcpuset to an empty set
  
https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R212
  for numa topologies that predate ovo

  and

  an new _migrate_legacy_dedicated_instance_cpuset function was added to
  migrate existing pinned instances and instnace with ovo in the  db.

  what we missed in the review is that unpinned guests should have had the 
cell.pcpuset set to the empty set
  here
  
https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R178

  The new filed is not nullable and is not present in the existing json 
serialised object
  as a result accessing cell.pcpuset on object returned form the db will raise 
a NotImplementedError because it is unset if the VM was created between liberty 
and victoria.
  this only applies to non-pinned vms with a numa topology i.e. 
  hw:mem_page_size= or hw:numa_nodes=

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2080556/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1969794] Re: backport of the fix for bug #1947370 make lock_path a requird config option when prvisouls it was optional

2022-04-21 Thread sean mooney
** Also affects: nova
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1969794

Title:
  backport of the fix for bug  #1947370 make lock_path a requird config
  option when prvisouls it was optional

Status in OpenStack Compute (nova):
  New
Status in os-brick:
  New

Bug description:
  https://review.opendev.org/q/topic:bug%252F1947370

  as part of fixing bug 1947370 (https://launchpad.net/bugs/1947370)
  https://review.opendev.org/c/openstack/os-brick/+/814139
  made the external lock_path config option required with no default provided

  this was then backported breaking nova unit tests on stabel branches and 
potentially
  any deployment that upgrade to a new version of os-brick without this defined.

  i don't belive that such a backport is in line with stable policy and if it 
was to be backported
  a sane default like /tmp/os_brick_lock would be required to not break 
existing installs.

  this i currently breaking downstream unit test for redhat osp 17 and
  its also breaking the upstream stable wallayb unittest for nova.

  it is unclear if this has directly broken any real world deployment
  but it has the potential too.

  as noted in this revert patch 
https://review.opendev.org/c/openstack/os-brick/+/838871
  it is trival to reproduce this

  
  git clone https://opendev.org/openstack/nova nova-test
  cd nova-test
  git checkout --track origin/stable/wallaby
  tox -e py3

  ^ this shoudl fail with the lock_path excption

  cd ..
  git clone https://opendev.org/openstack/os-brick os-brick-revert
  cd os-brick-revert
  git fetch https://review.opendev.org/openstack/os-brick 
refs/changes/71/838871/1 && git checkout FETCH_HEAD
  cd ../nova-test
  .tox/py3/bin/python3 -m pip install -e ../os-brick-revert
  tox -e py3

  that will no longer have the lock_path error

  .tox/py38/bin/python3 -m pip install os-brick\<4.3.3

  
  while I'm not sure the revert is the correct way to proceed we will need to 
blacklist the broken os-brick release in the requirement repo and come up with 
a backportable fix for all affected branches.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1969794/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


  1   2   3   >