[Yahoo-eng-team] [Bug 2083858] [NEW] nova allows AZ's to be renamed if instances are shelved
Public bug reported: downstream we had a bug report where live migration was failing on the az filter after a customer renamed an az. https://bugzilla.redhat.com/show_bug.cgi?id=2303395 nova does not support renaming AZs in general, or moving hosts with instance between AZs 5 years ago as part of https://bugs.launchpad.net/nova/+bug/1378904 we made the API reject renmaes when instance were on hosts https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24 we have since closed another edgecase with https://github.com/openstack/nova/commit/3c0eadae0b9ec48586087ea6c0c4e9176f0aa3bc in both case we missed the fact that if an instance was pinned to an az and then shelved it wont be considered as "on a host" so the safety checks we added for updating the az metadata or adding/removing hosts form an az does not account for shelved instances. if a shelved instance is pinned to a host or az and you update the host membership or update the az name or delete the az then it is possible to unshleve the instance but the request spec will refer to the new deleted/renamed az it is only possible to unshelve today on master because we have removed the az filter and when using placement the host aggregate az will not have changed its uuid even if the az name has changes. we have 2 potinetlay issue that we should fix. when updating an az name we should check if its refenced in any request spec for any non deleted instance. when adding or removing a host from a host aggrate(with az metadata) we should check if a request spec refers to the host and if it does is the az in the request spec compatible. this but is prmiarly for the first case as i was able to repodce that with horizon. the second case is speculation that i bleive could happen and we shoudl consider that when fixing. either provide it can happen and address it in this bug or as a separate one or note that it is blocked for some other reasons or otherwise out scope. ** Affects: nova Importance: Medium Status: Triaged ** Tags: availability-zones placement -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2083858 Title: nova allows AZ's to be renamed if instances are shelved Status in OpenStack Compute (nova): Triaged Bug description: downstream we had a bug report where live migration was failing on the az filter after a customer renamed an az. https://bugzilla.redhat.com/show_bug.cgi?id=2303395 nova does not support renaming AZs in general, or moving hosts with instance between AZs 5 years ago as part of https://bugs.launchpad.net/nova/+bug/1378904 we made the API reject renmaes when instance were on hosts https://github.com/openstack/nova/commit/8e19ef4173906da0b7c761da4de0728a2fd71e24 we have since closed another edgecase with https://github.com/openstack/nova/commit/3c0eadae0b9ec48586087ea6c0c4e9176f0aa3bc in both case we missed the fact that if an instance was pinned to an az and then shelved it wont be considered as "on a host" so the safety checks we added for updating the az metadata or adding/removing hosts form an az does not account for shelved instances. if a shelved instance is pinned to a host or az and you update the host membership or update the az name or delete the az then it is possible to unshleve the instance but the request spec will refer to the new deleted/renamed az it is only possible to unshelve today on master because we have removed the az filter and when using placement the host aggregate az will not have changed its uuid even if the az name has changes. we have 2 potinetlay issue that we should fix. when updating an az name we should check if its refenced in any request spec for any non deleted instance. when adding or removing a host from a host aggrate(with az metadata) we should check if a request spec refers to the host and if it does is the az in the request spec compatible. this but is prmiarly for the first case as i was able to repodce that with horizon. the second case is speculation that i bleive could happen and we shoudl consider that when fixing. either provide it can happen and address it in this bug or as a separate one or note that it is blocked for some other reasons or otherwise out scope. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2083858/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2085124] [NEW] HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to
Public bug reported: the flavor image conflict check for the virtio packed ring format is not correctly converting the values to booleans when comparing them as a result, the comparison is case sensitive when it should not be. ** Affects: nova Importance: Low Status: Triaged ** Tags: api libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2085124 Title: HTTP exception thrown: Flavor has hw:virtio_packed_ring extra spec explicitly set to True, conflicting with image which has hw_virtio_packed_ring explicitly set to true. Status in OpenStack Compute (nova): Triaged Bug description: the flavor image conflict check for the virtio packed ring format is not correctly converting the values to booleans when comparing them as a result, the comparison is case sensitive when it should not be. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2085124/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2078669] Re: Specify --availability-zone=nova does not work since caracal
since the intorduction of AZ nova has documented that pinning to the nova az should never be done https://docs.openstack.org/nova/latest/admin/availability-zones.html """ The use of the default availability zone name in requests can be very error-prone. Since the user can see the list of availability zones, they have no way to know whether the default availability zone name (currently nova) is provided because a host belongs to an aggregate whose AZ metadata key is set to nova, or because there is at least one host not belonging to any aggregate. Consequently, it is highly recommended for users to never ever ask for booting an instance by specifying an explicit AZ named nova and for operators to never set the AZ metadata for an aggregate to nova. This can result is some problems due to the fact that the instance AZ information is explicitly attached to nova which could break further move operations when either the host is moved to another aggregate or when the user would like to migrate the instance. """ while it is possible to do it is generally considered unsupported and incorrect by the nova core team. hoizon has historically been the leading culpert to people actually using nova in the request as it places it in the dropdown when creating a VM. this is a horizon bug and has never been a valid approach. you do not need to specify an az when creating a VM in nova and you should not in this case. ** Changed in: nova Status: In Progress => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2078669 Title: Specify --availability-zone=nova does not work since caracal Status in OpenStack Compute (nova): Opinion Bug description: openstack server create with --availability-zone=nova can place instance onto the compute which belongs to non default AZ since caracal release. The issue is introduced by removal of nova availabilityZone filter https://review.opendev.org/c/openstack/nova/+/886779. When using placement to filter AZs we add member_of=XXX, where XXX is the aggregate UUID which corresponds to AZ. In case of default AZ (nova) there is no any aggregate. As result we request computes without any aggregate/az filtering. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2078669/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2076089] Re: admin cannot force instance launch on disabled host
The disable feature on the comptue service is intended to prevent any scudling of new workload to a disabled host. this is intended to include new workload and all move operations to a disabled host. the host is being rejected as intended so setting this to invalid as the expectations of the reporter do not match the intended semantics of the api. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2076089 Title: admin cannot force instance launch on disabled host Status in OpenStack Compute (nova): Invalid Bug description: Description === I have a set of disabled nova compute services, with nova compute service up and running, and I would like to force instance creation, as admin, on a disabled conpute node for testing purposes. I added the option --availability-zone nova:$HOST to the openstack server create command, however it fails with "no valid host found" even if it should have skipped placement filters. Steps to reproduce == * openstack compute service list --service nova-compute +--+--+--+--+--+---++ | ID | Binary | Host | Zone | Status | State | Updated At | +--+--+--+--+--+---++ | cdfe3225-a705-4c30-9f1b-a34be15d89a0 | nova-compute | test1-cg0001 | nova | disabled | down | 2024-07-18T16:03:00.00 | | f44ad40d-b161-48b0-914a-738638dc10ea | nova-compute | test1-c0001 | nova | enabled | up| 2024-08-05T09:57:08.00 | | 3e15725b-6b9d-44e9-ae03-fe121d75017c | nova-compute | test1-c0003 | nova | disabled | up| 2024-08-05T09:57:05.00 | +--+--+--+--+--+---++ * openstack server create --wait --flavor 016016 --boot-from-volume 20 --image "Debian 12 (Switch Cloud)" --network my_private_network --availability-zone nova:test1-c0003 strider-force-launch Error creating server: strider-force-launch Error creating server Expected result === Launch process should have skipped placement filters and instance should have been launched on requested hypervisor Actual result = * Failure reason is "No valid host found": openstack server show strider-force-launch +-+---+ | Field | Value | +-+---+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host| None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | instance-1256 | | OS-EXT-STS:power_state | NOSTATE | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | error | | OS-SRV-USG:launched_at | None | | OS-SRV-USG:terminated_at| None | | accessIPv4 | | | accessIPv6 | | | addresses |
[Yahoo-eng-team] [Bug 1913016] Re: nova api os-resetState should not reset the state when VM is shelved_offloaded
** Changed in: nova Status: In Progress => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1913016 Title: nova api os-resetState should not reset the state when VM is shelved_offloaded Status in OpenStack Compute (nova): Opinion Bug description: when the VM is in SHELVED_OFFLOADED state the VM doesn't exist physically on any compute node so resetting the state to active or error might cause the DB inconsistency and also make unshelving difficult. ~~~ (overcloud) [stack@undercloud ~]$ nova list +--+---+---++-+--+ | ID | Name | Status| Task State | Power State | Networks | +--+---+---++-+--+ | f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 | test2 | SHELVED_OFFLOADED | - | Shutdown| sriov-net1-197=10.74.167.185 | +--+---+---++-+--+ (overcloud) [stack@undercloud ~]$ openstack server set --state active test2 (overcloud) [stack@undercloud ~]$ openstack server list +--+---++--+-+---+ | ID | Name | Status | Networks | Image | Flavor| +--+---++--+-+---+ | f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 | test2 | ACTIVE | sriov-net1-197=10.74.167.185 | rhel7.7 | m1-medium | +--+---++--+-+---+ (overcloud) [stack@undercloud ~]$ openstack server unshelve test2 Cannot 'unshelve' instance f86f9503-02c3-4c11-bd61-bfd9b9b8ad21 while it is in vm_state active (HTTP 409) (Request-ID: req-c992c5f5-63c9-4472-be75-9594bc682b37) ~~~ Not just unshelve, we cannot perform any VM operation as VM doesn't exist anywhere. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1913016/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2086867] Re: [RFE] Allow SCP operations to use IP addresses during Nova migrations
with the merging of https://review.opendev.org/c/openstack/nova/+/909122 this has been resolved on master. ** Also affects: nova/2024.1 Importance: Undecided Status: New ** Also affects: nova/2024.2 Importance: Undecided Status: New ** Also affects: nova/bobcat Importance: Undecided Status: New ** Also affects: nova/2025.1 Importance: Undecided Status: New ** Changed in: nova/2025.1 Status: New => Fix Released ** Changed in: nova/2024.2 Status: New => In Progress ** Changed in: nova/2025.1 Importance: Undecided => Low ** Changed in: nova/2024.2 Importance: Undecided => Low ** Changed in: nova/2025.1 Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/2024.2 Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/2024.1 Importance: Undecided => Low ** Changed in: nova/2024.1 Status: New => Triaged ** Changed in: nova/bobcat Importance: Undecided => Low ** Changed in: nova/bobcat Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2086867 Title: [RFE] Allow SCP operations to use IP addresses during Nova migrations Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) 2024.1 series: Triaged Status in OpenStack Compute (nova) 2024.2 series: In Progress Status in OpenStack Compute (nova) 2025.1 series: Fix Released Status in OpenStack Compute (nova) bobcat series: Triaged Bug description: When DNS resolution is unavailable in the environment, Nova compute operations that rely on SCP transfers between compute nodes fail because of failed hostname resolution. Proposed solution is to add a new configuration option [libvirt]migrations_use_ip_to_scp that allows destination compute nodes to use source compute IP addresses instead of hostnames for SCP operations. When enabled, Nova will lookup the source compute's IP address from the database and use it for file transfers. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2086867/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1506127] [NEW] enable vhost-user support with neutron ovs agent
Public bug reported: In the kilo cycle vhost-user support was added to nova and supported out of tree via the networking-ovs-dpdk ml2 driver and l2 agent on stack forge. in liberty agent modification were up streamed to enable the standard neutron openvswitch agent to manage the netdev datapath. in mitika it is desirable to remove all dependence on the networking-ovs-dpdk repo and enable the standard ovs ml2 driver to support vhost-user on enabled vswitches. to enable vhost-user support the following changes are proposed to the neutron openvswitch agent and ml2 mechanism driver. AGENT CHANGES: To determine if a vswitch supports vhost user interface two pieces of information are required the bridge datapath_type and the list of supported interfaces form the ovsdb. the datapath_type feild is require to ensure the node is configured to used the dpdk enabled netdev datapath. the supported interfaces types field in the ovsdb contains a list of all supported interface types for all supported datapath_types. if the ovs-vswitchd process has been compiled with supported for dpdk interface but not started with dpdk enabled , dpdk interfaces will be omitted from this list. the ovs neutron agent will be extended to query supported interfaces parameter in the ovsdb and append it to the configuration section of the agent state report. the ovs neutron agent will be extended to append the configured datapath_type to the configuration section of the agent state report. The OVS lib will be extended to retrieve the supported interfaces from the ovsdb. ML2 DRIVER CHANGES: the ovs ml2 agent will be extended to consult the agent configuration when selecting the vif type and vif binding details to install. if the datapath is netdev and the supported interface types contains vhost-user it will be enabled. in all other cases it will fall back to the current behavior. this mechanism will allow easy extension of the ovs neutron agent to support other ovs interfaces type in the future if enabled in nova. ** Affects: neutron Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New ** Tags: rfe ** Changed in: neutron Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1506127 Title: enable vhost-user support with neutron ovs agent Status in neutron: New Bug description: In the kilo cycle vhost-user support was added to nova and supported out of tree via the networking-ovs-dpdk ml2 driver and l2 agent on stack forge. in liberty agent modification were up streamed to enable the standard neutron openvswitch agent to manage the netdev datapath. in mitika it is desirable to remove all dependence on the networking-ovs-dpdk repo and enable the standard ovs ml2 driver to support vhost-user on enabled vswitches. to enable vhost-user support the following changes are proposed to the neutron openvswitch agent and ml2 mechanism driver. AGENT CHANGES: To determine if a vswitch supports vhost user interface two pieces of information are required the bridge datapath_type and the list of supported interfaces form the ovsdb. the datapath_type feild is require to ensure the node is configured to used the dpdk enabled netdev datapath. the supported interfaces types field in the ovsdb contains a list of all supported interface types for all supported datapath_types. if the ovs-vswitchd process has been compiled with supported for dpdk interface but not started with dpdk enabled , dpdk interfaces will be omitted from this list. the ovs neutron agent will be extended to query supported interfaces parameter in the ovsdb and append it to the configuration section of the agent state report. the ovs neutron agent will be extended to append the configured datapath_type to the configuration section of the agent state report. The OVS lib will be extended to retrieve the supported interfaces from the ovsdb. ML2 DRIVER CHANGES: the ovs ml2 agent will be extended to consult the agent configuration when selecting the vif type and vif binding details to install. if the datapath is netdev and the supported interface types contains vhost-user it will be enabled. in all other cases it will fall back to the current behavior. this mechanism will allow easy extension of the ovs neutron agent to support other ovs interfaces type in the future if enabled in nova. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1506127/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1509184] [NEW] Enable openflow based dvr routing for east/west traffic
vm -> destination mac update, ttl decremented -> dest vm ( single openflow action) - icmp from dest vm -> destination mac update, ttl decremented -> source vm ( single openflow action) other considerations: - north/south as ovs cannot lookup the destination mac dynamically via arp it is not possible to optimise the north/south path as described above. - openvswich support this mechanism is compatible with both kernel and dpdk ovs. this mechanism requires nicira extensions for arp rewrite. arp rewrite can be skipped for great support if required as it will fall back to tap device and kernel. icmp traffic for router interface will be handled by tap device as ovs currently does not support setting icmp type code via set_field or load openflow actions. - performance performance of l3 routing is expected to approach l2 performance for east/west traffic. performance is not expected to change for north/south. ** Affects: neutron Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New ** Tags: rfe ** Changed in: neutron Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1509184 Title: Enable openflow based dvr routing for east/west traffic Status in neutron: New Bug description: In the juno cycle dvr support was added to neutron do decentralise routing to the compute nodes. This RFE bug proposes the introduction of a new dvr mode (dvr_local_openflow) to optimise the datapath of east/west traffic. ---High level description--- The current implementation of DVR with ovs utilizes linux network namespaces to instantiate l3 routers, the details of which are described here: http://docs.openstack.org/networking-guide/scenario_dvr_ovs.html fundamentally a neutron router comprises of 3 elements. - a router instance (network namespace) - a router interface (tap device) - a set or routing rules (kernel ip routes) In the special case of routing east/west traffic both the source and destination interfaces are known to neutron. because of that fact neutron contains all the information required to logically route traffic from its origin to its destination enabling the path to be established primitively. this proposal suggests moving the instantiation of the dvr local router from the kernel ip stack to Open vSwitch(ovs) for east/west traffic. Open vSwitch provides a logical programmable interface (Openflow) to configure traffic forwarding and modification actions on arbitrary packet streams. When managed by the neutron openvswich l2 agent, ovs operates as a simple mac learning switch with limited utilisation of it programmable dataplane. to utilise ovs to create an l3 router the follow mappings from the 3 fundamental elements can be made - a router instance (network namespace + a ovs bridge) - a router interface (tap device + patch port pair) - a set or routing rules (kernel ip routes + openflow rules) background context- TL;DR basic explanation of openflow/ovs briges and patch ports skip to implementation section if familiar. ovs implementation background: In openvswich at the control layer an ovs bridge is a unique logical domain of interfaces and flow rules. Similarly at the control layer a patch port pair is a logical entity that interconnects two bridges(or logical domains). From a dataplane perspective each ovs bridge is first created as a separate instance of a dataplane. if these separate bridges/dataplanes are interconnected by patch ports, ovs will collapse the independent dataplanes into a single ovs dataplane instance. As a direct result of this implementation a logical topology of 1 bridge with two interfaces is realised in the dataplane level identically to 2 bridges each with 1 interface interconnected by path ports. This translate to zero dataplane overhead to the creation of multiple bridge allowing for arbitrary numbers of router instances to be created. Openflow capability background: The openflow protocol provides many capabilities which can be generally summarised as packet match criteria and actions to apply when the criteria is satisfied. In the case of l3 routeing the match criteria of relevance are the Ethernet type and the destination ip address.similarly the openflow actions required are mod_dest,set_field,move,dec_ttl,output and drop. logical packet flow for a ping between two vms on same host: in the l2 case if a vm tries to ping another vm in the same subnet thre are 4 stages. - first it will send a broadcast arp packet to learn the mac address from the destination ip of th
[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky
** Also affects: nova/victoria Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/wallaby Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1815989 Title: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky Status in neutron: In Progress Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) train series: New Status in OpenStack Compute (nova) ussuri series: New Status in OpenStack Compute (nova) victoria series: New Status in OpenStack Compute (nova) wallaby series: New Status in os-vif: Invalid Bug description: This issue is well known, and there were previous attempts to fix it, like this one https://bugs.launchpad.net/neutron/+bug/1414559 This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers. So far the only simply fix I have is to increase the number of RARP packets QEMU sends after live-migration from 5 to 10. To be complete, the nova change (not merged) proposed in the above mentioned activity does not work. I am creating this ticket hoping to get an up-to-date (for Rockey and onwards) expert advise on how to fix in nova-neutron. For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through. openvswitch-agent.log: 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port 57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648, 'fixed_ips': [ {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': '10.0.1.4'} ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:3e:de:af:47', 'device': u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']} 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42 tcpdump for rarp packets: [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1747496] Re: MTUs are not set for VIFs if using kernel ovs + hybrid plug = false
** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1747496 Title: MTUs are not set for VIFs if using kernel ovs + hybrid plug = false Status in OpenStack Compute (nova): Fix Released Bug description: Description === over the last few cyles supprot for mtu other then the default of 1500 has gernerally been improved in both nova and neutron. At the same time it was decided to remove the responsibility for VIF plugging from the virt drivers and centrailse it in os-vif. over the last few cycles os-vif has been enhanced to support setting the mtu on all codepaths and this work was completed in pike, however there are stills codepaths in the nova libvirt driver where os-vif is not used to plug the VIF and instead it is done by libvirt. when the VIF_TYPE is ovs and hybrid_plug=False libvirt plug the VM's VIFs iteself and os-vif is only responcible for creating the bridge it will be plugged into. in this case as the mtu is not set in teh libvirt xml and since os-vif is not respocible for pluggin the vif nothing set the mtu on the tap device that is added to ovs. This scenario arrises whenever libvirt is the nova virt driver and the no- op or openvswitch security group drivers are used. the end result is that in the vm the quest correctly recives the non default(e.g. jumbo frame) mtu form the neutron dhcp server and configures the mtu in its kernel but the mtu of the tap device added to the ovs bridge is left at the default of 1500 preventing jumboframes from being used by the guest. Steps to reproduce == using a host with a non default mtu deploy devstack normally useing libvirt + kvm/qemu and enable the openvsiwtch or no-op neutron security group driver [[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]] [securitygroup] firewall_driver = openvswitch or [[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]] [securitygroup] firewall_driver = noop spawn a singel vm via nova. and retrive the name of the interface for ovsdb or via virsh dumpxml. then run ifconfig and check the mtu. note if openvsiwtch direver is used you will need to allow icmp/ssh in the security groups to be able to validate network conncetivity. Expected result === tap should have same mtu as is set on neutron network. and a ping of max mtu e.g. ping -s 9000 ... for a network mtu of 9000 should work. Actual result = tap mtu will be 1500 it is not possible to ping the vm with a packet larger then Environment === 1. it was seen on pike but this effect all versions of openstack. before the introduction of os-vif we did not support neutron network mtus and after we started to use os-vif we enable neutron mtu support only for the os-vif codepath so this never worked. 2. Which hypervisor did you use? libvirt with kvm. this is not libvirt version specific as we do not generate the libvirt xml to set the mtu https://libvirt.org/formatdomain.html#mtu 2. Which storage type did you use? N/a but i used ceph 3. Which networking type did you use? neutron with kernel ovs and noop or openvsiwtch security group driver. not this will not happen with the iptables driver as that set hybrid_plug=True so os-vif is used to plug the VIF and it sets the mtu correctly. to To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1747496/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky
** Changed in: nova/train Status: Fix Released => New ** Also affects: neutron/ussuri Importance: Undecided Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) Status: In Progress ** Also affects: neutron/wallaby Importance: Undecided Status: New ** Also affects: neutron/train Importance: Undecided Status: New ** Also affects: neutron/victoria Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1815989 Title: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky Status in neutron: In Progress Status in neutron train series: New Status in neutron ussuri series: In Progress Status in neutron victoria series: New Status in neutron wallaby series: New Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) train series: New Status in OpenStack Compute (nova) ussuri series: New Status in OpenStack Compute (nova) victoria series: New Status in OpenStack Compute (nova) wallaby series: New Status in os-vif: Invalid Bug description: This issue is well known, and there were previous attempts to fix it, like this one https://bugs.launchpad.net/neutron/+bug/1414559 This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers. So far the only simply fix I have is to increase the number of RARP packets QEMU sends after live-migration from 5 to 10. To be complete, the nova change (not merged) proposed in the above mentioned activity does not work. I am creating this ticket hoping to get an up-to-date (for Rockey and onwards) expert advise on how to fix in nova-neutron. For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through. openvswitch-agent.log: 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port 57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648, 'fixed_ips': [ {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': '10.0.1.4'} ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:3e:de:af:47', 'device': u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']} 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42 tcpdump for rarp packets: [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.n
[Yahoo-eng-team] [Bug 1930706] [NEW] nova allows suboptimal emulator tread pinning for realtime guests
Public bug reported: today when ever you use a realtime guest you are required to enable cpu pinning and other feature such as spcifing a real time core mask via hw:cpu_realtime_mask or hw_cpu_realtime_mask. in the victoria release this requriement was relaxed somewhat with the intoduction of mixed cpu policy guest that are assinged pinned and floating cores. https://github.com/openstack/nova/commit/9fc63c764429c10f9041e6b53659e0cbd595bf6b It is now possible to allocate all cores in an instance to realtime and omit the ``hw:cpu_realtime_mask`` extra spec. This requires specifying the ``hw:emulator_threads_policy`` extra spec. https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/releasenotes/notes/bug-1884231-16acf297d88b122e.yaml however while that works well it also possible to hw:cpu_realtime_mask but not specify hw:emulator_threads_policy which leads to sub optimal xml generation for the libvirt driver. this is reported downstream as https://bugzilla.redhat.com/show_bug.cgi?id=1700390 for older releas that predata the changes referenced above. though in revaluation of this a possible improvment can be made as detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1700390#c11 today if we have a 2 core vm where guest cpu 0 is non realtime and guest cpu 1 is realtime we .e.g. hw:cpu_policy=dedicated hw:cpu_realtime=True hw:cpu_realtime_mask=^0 would generate the xml as follows this is because the default behavior when no emulator_threads_policy is specifed is for the emulator thread to float over all the vm cores. but a slight modifcation to the xml could be made to have a more optimal default in this case useing the cpu_realtime_mask we can instead restrict the emulator thread to float over the non realtime cores with realtime priortiy. this will ensure that if qemu need to process a request for a device attach for example that the emulator thread has higher priorty then the guest vcpus that deal with guest house keeping task but will not interupt the realtime cores. this would give many of the benifits of emulator_threads_policy=share or emulator_threads_policy=isolate without increase resource usage or requireing any config,flavor or image changes. this should also be a backporable solution to this problem. this is espically important given realtime host often are deplopy with the kernel isolcpus paramater which mean that the kernel will not load balance the emulator thread acrros the range and will instead leave it onthe core it intially spwaned on. today you coudl get lucky and it could be spawn on core 0 in which case the new behvior would be the same or it coudl get spwaned on core 1. wehn the emulatro thread is spawned on core 1 sicne it has less priority then the vcpu thread it will only run if the guest vcpu idels resulting in the iablity for qemu to process device attach and other qemu monitor commands form libvirt or the user. ** Affects: nova Importance: Wishlist Status: Triaged ** Tags: libvirt numa -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1930706 Title: nova allows suboptimal emulator tread pinning for realtime guests Status in OpenStack Compute (nova): Triaged Bug description: today when ever you use a realtime guest you are required to enable cpu pinning and other feature such as spcifing a real time core mask via hw:cpu_realtime_mask or hw_cpu_realtime_mask. in the victoria release this requriement was relaxed somewhat with the intoduction of mixed cpu policy guest that are assinged pinned and floating cores. https://github.com/openstack/nova/commit/9fc63c764429c10f9041e6b53659e0cbd595bf6b It is now possible to allocate all cores in an instance to realtime and omit the ``hw:cpu_realtime_mask`` extra spec. This requires specifying the ``hw:emulator_threads_policy`` extra spec. https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/releasenotes/notes/bug-1884231-16acf297d88b122e.yaml however while that works well it also possible to hw:cpu_realtime_mask but not specify hw:emulator_threads_policy which leads to sub optimal xml generation for the libvirt driver. this is reported downstream as https://bugzilla.redhat.com/show_bug.cgi?id=1700390 for older releas that predata the changes referenced above. though in revaluation of this a possible improvment can be made as detailed in https://bugzilla.redhat.com/show_bug.cgi?id=1700390#c11 today if we have a 2 core vm where guest cpu 0 is non realtime and guest cpu 1 is realtime we .e.g. hw:cpu_policy=dedicated hw:cpu_realtime=True hw:cpu_realtime_mask=^0 would generate the xml as follows this is because the default behavior when no emulator_threads_policy is specifed is for the emulator thread to float o
[Yahoo-eng-team] [Bug 1929446] Re: OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads
setting to invalid for nova as the error is in the ovs python bindings. marked as triaged for os-vif to track the enhancements proposed in comment 3 above. ** Also affects: ovsdbapp Importance: Undecided Status: New ** Changed in: os-vif Status: New => Triaged ** Changed in: os-vif Importance: Undecided => Medium ** Changed in: nova Status: Triaged => Invalid ** Changed in: os-vif Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1929446 Title: OVS polling loop created by ovsdbapp and os-vif starving n-cpu threads Status in OpenStack Compute (nova): Invalid Status in os-vif: Triaged Status in ovsdbapp: New Bug description: I've been seeing lots of failures caused by timeouts in test_volume_backed_live_migration during the live-migration and multinode grenade jobs, for example: https://zuul.opendev.org/t/openstack/build/bb6fd21b5d8c471a89f4f6598aa84e5d/logs During check_can_live_migrate_source I'm seeing the following gap in the logs that I can't explain: 12225 May 24 10:23:02.637600 ubuntu-focal-inap-mtl01-0024794054 nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 tempest-LiveMigrationTest-312230369-project-admin] [instance: 91a0e0ca-e6a8-43ab-8e68-a10a77ad615b] Check if temp file /opt/stack/data/nova/instances/tmp5lcmhuri exists to indicate shared storage is being used for migration. Exists? False {{(pid=107012) _check_shared_storage_test_file /opt/stack/nova/nova/virt/libvirt/driver.py:9367}} [..] 12282 May 24 10:24:22.385187 ubuntu-focal-inap-mtl01-0024794054 nova-compute[107012]: DEBUG nova.virt.libvirt.driver [None req-b5288b85-d642-426f-a525-c64724fe4091 tempest-LiveMigrationTest-312230369 tempest-LiveMigrationTest-312230369-project-admin] skipping disk /dev/sdb (vda) as it is a volume {{(pid=107012) _get_instance_disk_info_from_config /opt/stack/nova/nova/virt/libvirt/driver.py:10458}} ^ this leads to both the HTTP request to live migrate (that's still a synchronous call at this point [1]) *and* the RPC call from the dest to the source both timing out. [1] https://docs.openstack.org/nova/latest/reference/live- migration.html To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1929446/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1933097] [NEW] libivrt machine types are case sensitive but we do not validate them in nova
Public bug reported: as seen in http://paste.openstack.org/show/806818/ if we use machine type "Q35" instead of "q35" we fail to boot the vm. This is because the machine type name in libvirt are case sensitvie. however due to the way libvirt validates the xml it retruns a "No PCI buses available" error instead of a "incorrect machine type error" or similar that would be more intunitve. 021-06-20 02:37:39.795 7 ERROR nova.virt.libvirt.guest [req-04cb6169-bee4-407d-aef1-2e22abfccf97 329bf2535969456cb83fbc8e338ecb4c 5f3ea501afce4858b43186166d4d7afb - default default] Error defining a guest with XML: e2f47fae-7684-4f23-9f3e-39a6b133f929 instance-0006 4194304 4 http://openstack.org/xmlns/libvirt/nova/1.1";> test 2021-06-20 02:37:39 4096 10 0 0 4 sean sean OpenStack Foundation OpenStack Nova 23.0.2 e2f47fae-7684-4f23-9f3e-39a6b133f929 e2f47fae-7684-4f23-9f3e-39a6b133f929 Virtual Machine hvm 4096 /dev/urandom : libvirt.libvirtError: XML error: No PCI buses available since the libvirt machine types are case sensitive we cannot assume we can just lowercase teh users input but we should still be able to normalise the machine types in the following way. on startup we call virsh capablites to retrive info from libvirt regarding the capablities of the host.from that api we can retrieve the set of supported machine types. we can then construct a dictionary of lower-case machine type name to correct case machine type names. when booting a vm we should lowercase the user input and lookup the correct case form this dictonary. this will allow nova to continue to treat the input as case insensitive but still pass the correct value to libvirt. ** Affects: nova Importance: Low Status: Triaged ** Tags: libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1933097 Title: libivrt machine types are case sensitive but we do not validate them in nova Status in OpenStack Compute (nova): Triaged Bug description: as seen in http://paste.openstack.org/show/806818/ if we use machine type "Q35" instead of "q35" we fail to boot the vm. This is because the machine type name in libvirt are case sensitvie. however due to the way libvirt validates the xml it retruns a "No PCI buses available" error instead of a "incorrect machine type error" or similar that would be more intunitve. 021-06-20 02:37:39.795 7 ERROR nova.virt.libvirt.guest [req-04cb6169-bee4-407d-aef1-2e22abfccf97 329bf2535969456cb83fbc8e338ecb4c 5f3ea501afce4858b43186166d4d7afb - default default] Error defining a guest with XML: e2f47fae-7684-4f23-9f3e-39a6b133f929 instance-0006 4194304 4 http://openstack.org/xmlns/libvirt/nova/1.1";> test 2021-06-20 02:37:39 4096 10 0 0 4 sean sean OpenStack Foundation OpenStack Nova 23.0.2 e2f47fae-7684-4f23-9f3e-39a6b133f929 e2f47fae-7684-4f23-9f3e-39a6b133f929 Virtual Machine hvm 4096 /dev/urandom : libvirt.libvirtError: XML error: No PCI buses available since the libvirt machine types are case sensitive we cannot assume we can just lowercase teh users input but we should still be able to normalise the machine types in the following way. on startup we call virsh capablites to retrive info from libvirt regarding the capablities of the host.from that api we can retrieve the set of supported machine types. we can then construct a dictionary of lower-case machine type name to correct case machine type names. when booting a vm we should lowercase the user input and lookup the correct case form this dictonary. this will allow nova to continue to treat the input as c
[Yahoo-eng-team] [Bug 1933517] Re: [RFE][OVN] Create an intermediate OVS bridge between VM and intergration bridge to improve the live-migration process
** Also affects: os-vif Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1933517 Title: [RFE][OVN] Create an intermediate OVS bridge between VM and intergration bridge to improve the live-migration process Status in neutron: New Status in os-vif: New Bug description: When live migrating network sensitive VMs, the communication is broken. This is similar to [1] but in OVN the vif-plugged events are directly controller by the Neutron server, not by the OVS/DHCP agents. The problem lies in when the destination chassis creates the needed OF rules for the destination VM port. Same as in OVS, the VM port is created when the instance is unpaused. At this moment the VM continues sending packets through the interface but OVN didn't finish the configuration. Related BZs: - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1903653 - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1872937 - OSP16.1: https://bugzilla.redhat.com/show_bug.cgi?id=1966512 [1]https://bugs.launchpad.net/neutron/+bug/1901707 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1933517/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1860312] Re: compute service failed to delete
actully the operator woudl be deleting the comptue service after removing the compute nodes. you shoudl remove the compute service first but we shoudl fix this regardless. you should be able to recreate this bug by just creating a compute servce and then deleteing it. ** Changed in: nova Status: Expired => Triaged ** Changed in: nova Importance: Undecided => Medium -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1860312 Title: compute service failed to delete Status in OpenStack Compute (nova): Triaged Bug description: Description === I deployed openstack with openstack-helm on kubernetes.When one of the nova-compute service(driver=ironic replica of the deployment is 1) breakdown.It may be scheduled to another node by kubernetes.When I try to delete the old compute service(status down), it failed. Steps to reproduce == Firstly, openstack was deployed in kubernetes cluster, and the replica of the nova-compute-ironic is 1. * I deleted the pod nova-compute-ironic-x * then wait for the new pod to start * then exec openstack compute service list, there will be two compute service for ironic, the status of the old one would be down. * then I try to delete the old compute service Expected result === the old compute service could be deleted successfully Actual result = failed to delete, and returned an http 500 Environment === 1. Exact version of OpenStack you are running. See the following 18.2.2, rocky 2. Which hypervisor did you use? Libvirt + KVM 2. Which storage type did you use? ceph 3. Which networking type did you use? Neutron with OpenVSwitch Logs & Configs == 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi [req-922cc601-9aa1-4c3d-ad9c-71f73a341c28 40e7b8c3d59943e08a52acd24fe30652 d13f1690c08d41ac854d720ea510a710 - default default] Unexpected exception in API method: ComputeHostNotFound: Compute host mgt-slave03 could not be found. 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi Traceback (most recent call last): 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/compute/services.py", line 252, in delete 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi context, service.host) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi result = fn(cls, context, *args, **kwargs) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 443, in get_all_by_host 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi use_slave=use_slave) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 213, in wrapper 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 438, in _db_compute_node_get_all_by_host 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return db.compute_node_get_all_by_host(context, host) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/api.py", line 291, in compute_node_get_all_by_host 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return IMPL.compute_node_get_all_by_host(context, host) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 258, in wrapped 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(context, *args, **kwargs) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 659, in compute_node_get_all_by_host 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi raise exception.ComputeHostNotFound(host=host) 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi ComputeHostNotFound: Compute host mgt-slave03 could not be found. 2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi 2020-01-20 06:44:53.480 1
[Yahoo-eng-team] [Bug 1934742] Re: nova may leak net interface in guest if port under attaching/deleting
** Also affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1934742 Title: nova may leak net interface in guest if port under attaching/deleting Status in neutron: New Status in OpenStack Compute (nova): In Progress Bug description: Description === It seems that nova may leak network interface in guest if a port deletion is run in the middle of the a port attachment in compute manager, attach_interface run atomically the following tasks: -update port in neutron(Binding) -... -driver.attach_interface() -update net_info_cache -... When a Bound port is deleted, nova receive an event "network-vif-deleted" and process it by running def _process_instance_vif_deleted_event() driver.detach_interface() if this event processing is done just after port binding and before driver.attach_interface() of an ongoing interface attachment of the same port, nova will attach the deleted orphan interface to guest Probably, the this event processing must be synchronized with compute manager method attach_interface/detach_interface. Steps to reproduce == on master devstack: $openstack server create --flavor m1.small --image cirros-0.5.2-x86_64-disk \ --nic net-id=private myvm $openstack port create --network private myport # For ease of reproduction add a pause just before driver.attach_interface(): nova/compute/manager.py: def attach_interface() try: time.sleep(8) self.driver.attach_interface(context, ...) $sudo service devstack@n-cpu restart $openstack server add port myvm myport & $sleep 4 ; openstack port delete myport [1]+ Exit 1 openstack server add port myvm myport Port id 3d47bceb-34ef-4002-8e33-30957127a87f could not be found. (HTTP 404) (Request-ID: req-6c056ad3-1e61-4102-9e5e-48cdd4dffc43) $ nova interface-list alex ++--+--+---+---+-+ | Port State | Port ID | Net ID | IP addresses | MAC Addr | Tag | ++--+--+---+---+-+ | ACTIVE | 0fe9365b-5747-4532-be50-e6362b10b645 | d8f03257-d1e2-4488-bc42-0e189481a6c7 | 10.0.0.49,fde5:2b4:b028:0:f816:3eff:feb8:f14c | fa:16:3e:b8:f1:4c | - | ++--+--+---+---+-+ $ virsh domiflist instance-0001 InterfaceType Source ModelMAC tap0fe9365b-57 bridge br-int virtio fa:16:3e:b8:f1:4c tapdcbbae72-0b bridge br-int virtio fa:16:3e:95:91:25 Expected result === interface should not be attached to guest Actual result = zombie interface is attached to guest To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1934742/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1940555] Re: Compute Component: Error: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist")
** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Status: New => Triaged ** Changed in: nova Importance: Undecided => Critical ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1940555 Title: Compute Component: Error: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist") Status in OpenStack Compute (nova): Triaged Status in tripleo: Triaged Bug description: https://logserver.rdoproject.org/openstack-component- compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci- centos-8-standalone-compute- master/7dac4e0/logs/undercloud/var/log/extra/podman/containers/nova_db_sync/stdout.log.txt.gz Is [api_database]/connection set in nova.conf? Is the cell0 database connection URL correct? Error: (pymysql.err.ProgrammingError) (1146, "Table 'nova_api.cell_mappings' doesn't exist") [SQL: SELECT cell_mappings.created_at AS cell_mappings_created_at, cell_mappings.updated_at AS cell_mappings_updated_at, cell_mappings.id AS cell_mappings_id, cell_mappings.uuid AS cell_mappings_uuid, cell_mappings.name AS cell_mappings_name, cell_mappings.transport_url AS cell_mappings_transport_url, cell_mappings.database_connection AS cell_mappings_database_connection, cell_mappings.disabled AS cell_mappings_disabled FROM cell_mappings WHERE cell_mappings.uuid = %(uuid_1)s LIMIT %(param_1)s] [parameters: {'uuid_1': '----', 'param_1': 1}] (Background on this error at: http://sqlalche.me/e/14/f405) https://logserver.rdoproject.org/openstack-component-compute/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-standalone-compute-master/7dac4e0/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz + echo 'Running command: '\''/usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c '\''/usr/bin/nova-manage db sync '\'''\''' + exec /usr/bin/bootstrap_host_exec nova_conductor su nova -s /bin/bash -c ''\''/usr/bin/nova-manage' db sync \' 2021-08-19 08:17:33.982762 | fa163e06-c6d2-5dfd-0459-197e | FATAL | Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | standalone | error={"changed": false, "msg": "Failed containers: nova_api_db_sync, nova_api_map_cell0, nova_api_ensure_default_cell, nova_db_sync"} 2021-08-19 08:17:33.983320 | fa163e06-c6d2-5dfd-0459-197e | TIMING | tripleo_container_manage : Create containers managed by Podman for /var/lib/tripleo-config/container-startup-config/step_3 | standalone | 0:19:23.159835 | 41.20s To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1940555/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1944111] Re: Missing __init__.py in nova/db/api
setting to critical since this blocks packaging of xena ** Also affects: nova/xena Importance: Critical Status: In Progress ** Also affects: nova/yoga Importance: Undecided Status: New ** Changed in: nova/yoga Status: New => In Progress ** Changed in: nova/yoga Importance: Undecided => Critical -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1944111 Title: Missing __init__.py in nova/db/api Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) xena series: In Progress Status in OpenStack Compute (nova) yoga series: In Progress Bug description: Looks like nova/db/api is missing an __init__.py, which breaks *at least* my Debian packaging. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1944111/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1944083] Re: Nova assumptions about /32 routes to NS' break name resolution under DHCP
im not sure that nova is incontrol of this. this seams like a issue likely with dhcp? i dont think nova actully set /32 routes for the gateways itself. ** Also affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1944083 Title: Nova assumptions about /32 routes to NS' break name resolution under DHCP Status in neutron: New Status in OpenStack Compute (nova): New Bug description: We run designate out of a private VLAN which is accessible via one of the two external networks in our wallaby cloud. In order to permit instances name resolution via those endpoints, we add a route to the subnet in that private VLAN via the 2nd router added to each network, the external network of which is our OutsidePrivate net (the External network which resides inside the DC, vs our OutsidePublic which is a VLAN to the actual WAN). Unfortunately, despite setting up this 2nd router and explicit route, we see nova instances coming up with an explicit /32 route to each DNS server specified _via the .1 gateway_ in the network which is the router to OutsidePrivate, and despite an explicit route to the /24 (i know CIDR works in smallest subnet preference) which should be understood to encapsulate the 3 IPs of the NS' themselves and prevent the /32 routes from being created. Even setting explicit /32 routes to each NS via the 2nd gateway @ .2 doesn't work - the original /32's via the .1 are still present, and the only fix we've found is to force nodes to static addressing and routing via cloud-init. ICMP redirect from the primary gateway to the secondary is hit-or-miss, and not how this should work anyway. I've not found anything in the docs about how these default routes via the primary gateway are set up, and have therefore found no way to disable them so filing this a bug since it's a major impediment to anyone resolving names via any gateway but the one set as the default gateway for the network. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1944083/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945646] Re: Nova fails to live migrate instance with upper-case port MAC
adding neutron as i think neutron shoudl also be normaliasing the mac adress that users provide and alwasy storign it in lower case. a mac is technically a number not a string we just use hex encoding for human readablity so the caseing does not matter but it would be nice to at least consider moving this normalisation to the neutron api/db to avoid this problem. ** Also affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1945646 Title: Nova fails to live migrate instance with upper-case port MAC Status in neutron: New Status in OpenStack Compute (nova): In Progress Bug description: Description === When neutron port has MAC address defined in upper case and libvirt stores MAC in XML in lower case, migration is failed with KeyError: ``` Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: 2021-09-30 10:31:38.028 3054313 ERROR nova.virt.libvirt.driver [req-911a4b70-5448-48a1-afa4-1bbd0b38737b - - - - -] [instance: 75f7 9d85-6505-486c-bc34-e78fd6350a77] Live Migration failure: '00:50:56:af:e1:73' Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: Traceback (most recent call last): Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 461, in fire_timers Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: timer() Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__ Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: cb(*args, **kw) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/event.py", line 175, in _do_send Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: waiter.switch(result) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/eventlet/greenthread.py", line 221, in main Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: result = function(*args, **kwargs) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/utils.py", line 661, in context_wrapper Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: return func(*args, **kwargs) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 9196, in _live_migration_operation Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: LOG.error("Live Migration failure: %s", e, instance=instance) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 220, in __exit__ Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: self.force_reraise() Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/oslo_utils/excutils.py", line 196, in force_reraise Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: six.reraise(self.type_, self.value, self.tb) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/six.py", line 703, in reraise Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: raise value Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 9152, in _live_migration_operation Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: new_xml_str = libvirt_migrate.get_updated_guest_xml( Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/migration.py", line 65, in get_updated_guest_xml Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: xml_doc = _update_vif_xml(xml_doc, migrate_data, get_vif_config) Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: File "/openstack/venvs/nova-22.3.1/lib/python3.8/site-packages/nova/virt/libvirt/migration.py", line 355, in _update_vif_xml Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: migrate_vif = migrate_vif_by_mac[mac_addr] Sep 30 10:31:38 cc-compute08-dx1 nova-compute[3054313]: KeyError: '00:50:56:af:e1:73' ``` Environment === Ubuntu 20.04 Libvirt 6.0.0-0ubuntu8.14 Nova 22.2.3.dev2 (sha 4ce01d6c49f81b6b2438549b01a89ea1b5956320) Neutron with OpenVSwitch To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1945646/+subscriptions -- Mailing lis
[Yahoo-eng-team] [Bug 1943969] Re: Unable to use shared security groups for VM creation
This is an RFE not a bug. This should be addressed via a specless blueprint as it is a new capablity. ** Changed in: nova Status: In Progress => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1943969 Title: Unable to use shared security groups for VM creation Status in OpenStack Compute (nova): Invalid Bug description: Description === Nova does not support shared security groups for new virtual mashines. It happens because Nova filters security groups by tenant ID here https://github.com/openstack/nova/blob/master/nova/network/neutron.py#L813 Steps to reproduce == * create two projects A and B * in project A create security group in Neutron * share the security group to project B via RBAC (https://docs.openstack.org/neutron/latest/admin/config-rbac.html#sharing-a-security-group-with-specific-projects) * try to create VM with this security group in project B Expected result === The VM should be created if security group shared to this project. Actual result = The error in logs: Traceback (most recent call last): File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", line 2510, in _build_resources yield resources File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", line 2271, in _build_and_run_instance block_device_info=block_device_info) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/driver.py", line 505, in spawn admin_password, network_info, block_device_info) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py", line 1175, in spawn vm_folder) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py", line 342, in build_virtual_machine vm_name=vm_name) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vmops.py", line 311, in _get_vm_config_spec network_info) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/virt/vmwareapi/vif.py", line 187, in get_vif_info for vif in network_info: File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", line 585, in __iter__ return self._sync_wrapper(fn, *args, **kwargs) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", line 576, in _sync_wrapper self.wait() File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/model.py", line 608, in wait self[:] = self._gt.wait() File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait return self._exit_event.wait() File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait current.throw(*self._exc) File "/var/lib/kolla/venv/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main result = function(*args, **kwargs) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/utils.py", line 828, in context_wrapper return func(*args, **kwargs) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", line 1656, in _allocate_network_async six.reraise(*exc_info) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/compute/manager.py", line 1639, in _allocate_network_async bind_host_id=bind_host_id) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/neutronv2/api.py", line 1043, in allocate_for_instance instance, neutron, security_groups) File "/nova-base-source/nova-base-archive-stable-rocky-m3/nova/network/neutronv2/api.py", line 830, in _process_security_groups security_group_id=security_group) SecurityGroupNotFound: Security group 0c649378-1cf8-48e0-9eb4-b72772c35a62 not found. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1943969/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1954427] Re: nova-ceph-multistore job fails permanently with: Cannot uninstall 'logutils'
** Also affects: devstack-plugin-ceph Importance: Undecided Status: New ** Changed in: devstack Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1954427 Title: nova-ceph-multistore job fails permanently with: Cannot uninstall 'logutils' Status in devstack: Invalid Status in devstack-plugin-ceph: New Status in OpenStack Compute (nova): New Bug description: The last 4 run[2] failed with the same issue[1]: 2021-12-10 10:42:59.793429 | controller | Attempting uninstall: logutils 2021-12-10 10:42:59.793490 | controller | Found existing installation: logutils 0.3.3 2021-12-10 10:42:59.793500 | controller | ERROR: Cannot uninstall 'logutils'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall. 2021-12-10 10:43:00.083297 | controller | + inc/python:pip_install:1 : exit_trap [1] https://zuul.opendev.org/t/openstack/build/722c6caf8e454849b897a43bcf617dd2/log/job-output.txt#9419 [2] https://zuul.opendev.org/t/openstack/builds?job_name=nova-ceph-multistore&project=openstack/nova To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1954427/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1959682] Re: String concatenation TypeError in resize flavor helper
setting this to invilad since the bug is in tempest. it is currently blocking the nova-next job so it is a gate-blocker for nova until this is fixed in tempest. as there are already 2 patches up to fix this we expect it to be resolved soon so we will just clost the nova part for now. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1959682 Title: String concatenation TypeError in resize flavor helper Status in OpenStack Compute (nova): Invalid Status in tempest: In Progress Bug description: In cae966812, for certain resize tests, we started adding a numeric ID to the new flavor name to avoid collisions. This was incorrectly done as a string + int concatenation, which is raising a `TypeError: can only concatenate str (not "int") to str`. Example of this happening in nova-next job: https://zuul.opendev.org/t/openstack/build/7f750faf22ec48219ddd072cfe6e02e1/logs To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1959682/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1960247] Re: server suspend action allows authorization by user_id while server resume action does not
ack i kind of agree with gmann here gmann is correct that this does not align with the direction we are moving in with our new policy/rbac work and that our intent was to eventually remove it outside of keypairs. the spec linked above clearly state what our intentions were and the enpoint on which it could be used. as such I'm going to update this to invalid but we can continue this conversation on the mailing list, irc or in the nova team meeting. ** Changed in: nova Status: In Progress => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1960247 Title: server suspend action allows authorization by user_id while server resume action does not Status in OpenStack Compute (nova): Opinion Bug description: Description === Since the following change was merged, nova allows authorization by user_id for server suspend action. https://review.opendev.org/c/openstack/nova/+/353344 However the same is not yet implemented in resume action and this results in inconsistent policy rule for corresponding two operations. Steps to reproduce == * Define policy rules like the following example "os_compute_api:os-suspend-server:suspend": "rule:admin_api or user_id:%(user_id)s" "os_compute_api:os-suspend-server:resume": "rule:admin_api or user_id:%(user_id)s" * Create a server by a non-admin user * Suspend the server by the user * Resume the server by the user Expected result === Both suspend and resume are accepted Actual result = Only suspend is accepted and resume fails with ERROR (Forbidden): Policy doesn't allow os_compute_api:os-suspend- server:suspend to be performed. (HTTP 403) (Request-ID: req-...) Environment === This issue was initially reported as one found in stable/xena deployment. http://lists.openstack.org/pipermail/openstack-discuss/2022-February/027078.html Logs & Configs == N/A To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1960247/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1964149] [NEW] nova dns lookups can block the nova api process leading to 503 errors.
Public bug reported: we currently have 4 possibly related downstream bugs whereby DNS lookups can result in 503 errors as we do not monkey patch green DNS and that can result in blocking behavior. specifically we have seen callses to socket.getaddrinfo in py-amqp block the API when using ipv6. https://bugzilla.redhat.com/show_bug.cgi?id=2037690 https://bugzilla.redhat.com/show_bug.cgi?id=2050867 https://bugzilla.redhat.com/show_bug.cgi?id=2051631 https://bugzilla.redhat.com/show_bug.cgi?id=2056504 copying a summary of the rca from one of the bugs What happens: - A request comes in which requires rpc, so a new connection to rabbitmq is to be established - The hostname(s) from the transport_url setting are ultimately passed to py-amqp, which attempts to resolve the hostname to an ip address so it can set up the underlying socket and connect - py-amqp explicitly tries to resolve with AF_INET first and then only if that fails, then it tries with AF_INET6[1] - The customer environment is primarily IPv6. Attempting to resolve the hostname via AF_INET fails nss_hosts (the /etc/hosts file only have IPv6 addrs), and falls through to nss_dns - Something about the customer DNS infrastructure is slow, so it takes a long time (~10 seconds) for this IPv4-lookup to fail. - py-amqp finally tries with AF_INET6 and the hostname is resolved immediately via nss_hosts because the entry is in the /etc/hosts Critically, because nova explicitly disables greendns[2] with eventlet, the *entire* nova-api worker is blocked during the duration of the slow name resolution, because socket.getaddrinfo is a blocking call into glibc. [1] https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208 [2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36 nova currently disables greendns monkeypatch because of a very old bug on centos 6 on python 2.6 and the havana release of nova https://bugs.launchpad.net/nova/+bug/1164822 ipv6 support was added in v0.17 in the same release that added python 3 support back in 2015 https://github.com/eventlet/eventlet/issues/8#issuecomment-75490457 so we should not need to work around the lack of ipv6 support anymore. https://review.opendev.org/c/openstack/nova/+/830966 ** Affects: nova Importance: Medium Assignee: sean mooney (sean-k-mooney) Status: Triaged ** Tags: api yoga-rc-potential ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Status: New => Triaged ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Tags added: api yoga-rc-potential -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1964149 Title: nova dns lookups can block the nova api process leading to 503 errors. Status in OpenStack Compute (nova): Triaged Bug description: we currently have 4 possibly related downstream bugs whereby DNS lookups can result in 503 errors as we do not monkey patch green DNS and that can result in blocking behavior. specifically we have seen callses to socket.getaddrinfo in py-amqp block the API when using ipv6. https://bugzilla.redhat.com/show_bug.cgi?id=2037690 https://bugzilla.redhat.com/show_bug.cgi?id=2050867 https://bugzilla.redhat.com/show_bug.cgi?id=2051631 https://bugzilla.redhat.com/show_bug.cgi?id=2056504 copying a summary of the rca from one of the bugs What happens: - A request comes in which requires rpc, so a new connection to rabbitmq is to be established - The hostname(s) from the transport_url setting are ultimately passed to py-amqp, which attempts to resolve the hostname to an ip address so it can set up the underlying socket and connect - py-amqp explicitly tries to resolve with AF_INET first and then only if that fails, then it tries with AF_INET6[1] - The customer environment is primarily IPv6. Attempting to resolve the hostname via AF_INET fails nss_hosts (the /etc/hosts file only have IPv6 addrs), and falls through to nss_dns - Something about the customer DNS infrastructure is slow, so it takes a long time (~10 seconds) for this IPv4-lookup to fail. - py-amqp finally tries with AF_INET6 and the hostname is resolved immediately via nss_hosts because the entry is in the /etc/hosts Critically, because nova explicitly disables greendns[2] with eventlet, the *entire* nova-api worker is blocked during the duration of the slow name resolution, because socket.getaddrinfo is a blocking call into glibc. [1] https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208 [2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36 nova currently disables greendns monkeypatch because of a very old bug on centos 6 on pytho
[Yahoo-eng-team] [Bug 1801919] Re: brctl is obsolete use ip
its not released yet but ill be releaseing os-vif today. it looks like teh rule for setting fix released are wrong. it sould only be set when we release that commit in a tagged release on pypi/ tarballs.openstack.org. ** Changed in: os-vif Status: Fix Released => Fix Committed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1801919 Title: brctl is obsolete use ip Status in devstack: In Progress Status in neutron: Fix Released Status in OpenStack Compute (nova): Confirmed Status in os-vif: Fix Committed Bug description: bridge-utils (brctl) is obsolete, no modern software should depend on it. Used in: neutron/agent/linux/bridge_lib.py http://man7.org/linux/man-pages/man8/brctl.8.html Please use `ip` for basic bridge operations, than we can drop one obsolete dependency.. To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1801919/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1822575] [NEW] lower-constraints are not used in gate job
Public bug reported: the lower constraints tox env attempts to run nova's unit tests with the minium supported software versions declared in nova lower-constraints.txt due to the way the install command is specified in the default tox env install_command = pip install -c{env:UPPER_CONSTRAINTS_FILE:https://git.openstack.org/cgit/openstack/requirements/plain /upper-constraints.txt} {opts} {packages} the upper-constraints.txt was also passed to pip. pips constraint solver takes the first deffintion of a constraint and discards all redfinitoins. because upper-constraints.txt was included before lower-constraints.txt the lower constraints were ignored. there are two patchs proposed to fix this https://review.openstack.org/#/c/622972 and https://review.openstack.org/#/c/645392 we should merge one of them. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1822575 Title: lower-constraints are not used in gate job Status in OpenStack Compute (nova): New Bug description: the lower constraints tox env attempts to run nova's unit tests with the minium supported software versions declared in nova lower-constraints.txt due to the way the install command is specified in the default tox env install_command = pip install -c{env:UPPER_CONSTRAINTS_FILE:https://git.openstack.org/cgit/openstack/requirements/plain /upper-constraints.txt} {opts} {packages} the upper-constraints.txt was also passed to pip. pips constraint solver takes the first deffintion of a constraint and discards all redfinitoins. because upper-constraints.txt was included before lower-constraints.txt the lower constraints were ignored. there are two patchs proposed to fix this https://review.openstack.org/#/c/622972 and https://review.openstack.org/#/c/645392 we should merge one of them. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1822575/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1821938] Re: No nova hypervisor can be enabled on workers with QAT devices
** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Importance: Undecided => High ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova Status: New => In Progress ** Tags added: stein-rc-potential -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1821938 Title: No nova hypervisor can be enabled on workers with QAT devices Status in OpenStack Compute (nova): In Progress Status in StarlingX: Triaged Bug description: Brief Description - Unable to enable a host as nova hypervisor due to pci device cannot be found if the host has QAT devices (C62x or DH895XCC) configured. Severity Major Steps to Reproduce -- - Install and configure a system where worker nodes have QAT devices configured. e.g., [wrsroot@controller-0 ~(keystone_admin)]$ system host-device-list compute-0 +--+--+--+---+---+---+-++---+-+ | name | address | class id | vendor id | device id | class name | vendor name | device name | numa_node | enabled | +--+--+--+---+---+---+-++---+-+ | pci__09_00_0 | :09:00.0 | 0b4000 | 8086 | 0435 | Co-processor | Intel Corporation | DH895XCC Series QAT | 0 | True | | pci__0c_00_0 | :0c:00.0 | 03 | 102b | 0522 | VGA compatible controller | Matrox Electronics Systems Ltd. | MGA G200e [Pilot] ServerEngines (SEP1) | 0 | True | +--+--+--+---+---+---+-++---+-+ compute-0:~$ lspci | grep QAT 09:00.0 Co-processor: Intel Corporation DH895XCC Series QAT 09:01.0 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function 09:01.1 Co-processor: Intel Corporation DH895XCC Series QAT Virtual Function ... - check nova hypervisor-list Expected Behavior -- - Nova hypervisors exist on system Actual Behavior [wrsroot@controller-0 ~(keystone_admin)]$ nova hypervisor-list ++-+---++ | ID | Hypervisor hostname | State | Status | ++-+---++ ++-+---++ Reproducibility --- Reproducible System Configuration Any system type with QAT devices configured on worker node Branch/Pull Time/Commit --- master as of 2019-03-18 Last Pass -- on f/stein branch in early feb Timestamp/Logs -- # nova-compute pods are spewing errors so they can't register themselves properly as hypervisors: 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager [req-4f652d4c-da7e-4516-9baa-915265c3fdda - - - - -] Error updating resources for node compute-0.: PciDeviceNotFoundById: PCI device :09:02.3 not found 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager Traceback (most recent call last): 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 7956, in _update_available_resource_for_node 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager startup=startup) 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 727, in update_available_resource 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename) 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7098, in get_available_resource 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager self._get_pci_passthrough_devices() 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6102, in _get_pci_passthrough_devices 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager pci_info.append(self._get_pcidev_info(name)) 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager File "/var/lib/openstack/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6062, in _get_pcidev_info 2019-03-25 18:46:49,899.899 62394 ERROR nova.compute.manager
[Yahoo-eng-team] [Bug 1829161] [NEW] Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='git.openstack.org', port=443)
Public bug reported: The tempest jobs have stared to periodicaly fail with Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='git.openstack.org', port=443) starting on may 6th http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Could%20not%20install%20packages%20due%20to%20an%20EnvironmentError:%20HTTPSConnectionPool(host%3D'git.openstack.org',%20port%3D443)%5C%22 based on the logstash results this has been hit ~330 times in the last 7 days this appears to trigger more frequently on the grenade jobs but also effects others. this looks like an infra issue likely related to the redicrts not working in all cases. this is a tracking bug untill the issue is resolved. ** Affects: nova Importance: Critical Status: Triaged ** Tags: gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1829161 Title: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='git.openstack.org', port=443) Status in OpenStack Compute (nova): Triaged Bug description: The tempest jobs have stared to periodicaly fail with Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='git.openstack.org', port=443) starting on may 6th http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Could%20not%20install%20packages%20due%20to%20an%20EnvironmentError:%20HTTPSConnectionPool(host%3D'git.openstack.org',%20port%3D443)%5C%22 based on the logstash results this has been hit ~330 times in the last 7 days this appears to trigger more frequently on the grenade jobs but also effects others. this looks like an infra issue likely related to the redicrts not working in all cases. this is a tracking bug untill the issue is resolved. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1829161/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1821089] Re: assign PCI slot for VM's NIC persistently
Stable device naming within the guest is OS dependent and strictly out of scope of nova to fix. nova does not chose the address at which device are attached and the nova api doe not guarentee stable nic ordering. the vm pci adress is determined by libvirt. the device role tagging feature was developed for this usecase specifically so that vms could determin the mapping between device that are exposed to the guest and the openstack resouce the correspond to in a hyperviors and os independent way. https://specs.openstack.org/openstack/nova-specs/specs/mitaka/approved/virt-device-role-tagging.html ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1821089 Title: assign PCI slot for VM's NIC persistently Status in OpenStack Compute (nova): Invalid Bug description: Nova doesn't care about PCI slot number where virtual NIC is attached. As a result guests (recent Ubuntu for example) in which NIC name depends on PCI slot number rename interfaces in circumstances described below: 1. Launch VM using Ubuntu cloud image with 1 interface. Name of the interface will be like "ens3" $ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 2. Attach more interfaces (nova interface-attach). Attached interfaces will get names like "ens7" $ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:06.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 00:07.0 Ethernet controller: Red Hat, Inc Virtio network device 3. Do "nova reboot --hard" for this VM (this action regenerates XML in Libvirt). Interfaces "ens7" will be renamed to "ens4" since Libvirt XML for this VM will be recreated. lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc Virtio network device 00:04.0 Ethernet controller: Red Hat, Inc Virtio network device 00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device 00:07.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon 4. Compare names of interfaces after step 2 and step 3. Same happens after interfaces detached: For example if VM has ens3, ens4, ens5 then detach ens4 then ens5 will be renamed to renamed on hard reboot. Ideally I would expect from Nova to assign PCI slot number to attached devices and keep this assignment in XML in /var/lib/nova/instances//libvirt.xml OpenStack version: Newton (newer versions also affected) hypervisor: Libvirt+KVM networking type: Neutron with OpenVSwitch To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1821089/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1831723] [NEW] The flavor hide_hypervisor_id value can be overridden by the image img_hide_hypervisor_id
Public bug reported: During the implementation of enabling hypervisor hiding for windows guests it became apparent that a latent bug exits that allows non privaldges users to override the policy set by the admin in the flavor by uploading a custom image. by convention back in the havan/icehouse days we used to allow the flavor to take precendece over the image if there was a conflcit and log a warning. sometime aound liberty/mitaka we decided that was a bad user experence for endusers as they did not recive what they asked for and started to convert all confict into a hard error. The only case where we intentionally allow the image to take prescedece over the flavor is hw:mem_page_size where it is allows if an only if the adming has set hw:mem_p[age_size to large or any expcltly. in other words unless the admin has opted in to allowing ther image to take precendece by not setting a value in the flavor or setint it to a value that allows the image to refine the choice we do not support image overriding flavors. the current code does exactly that by the use of a logical or flavor_hide_kvm = strutils.bool_from_string( flavor.get('extra_specs', {}).get('hide_hypervisor_id')) if (virt_type in ("qemu", "kvm") and (image_meta.properties.get('img_hide_hypervisor_id') or flavor_hide_kvm)): and the new code hide_hypervisor_id = (strutils.bool_from_string( flavor.extra_specs.get('hide_hypervisor_id')) or image_meta.properties.get('img_hide_hypervisor_id')) exibits the same behavior. in both cases if img_hide_hypervisor_id=true and hide_hypervisor_id=false hypervior hiding will be enabled. in this specific case the side-effects of this are safe but it may not be in all cases of this pattern. ** Affects: nova Importance: Undecided Status: New ** Tags: libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1831723 Title: The flavor hide_hypervisor_id value can be overridden by the image img_hide_hypervisor_id Status in OpenStack Compute (nova): New Bug description: During the implementation of enabling hypervisor hiding for windows guests it became apparent that a latent bug exits that allows non privaldges users to override the policy set by the admin in the flavor by uploading a custom image. by convention back in the havan/icehouse days we used to allow the flavor to take precendece over the image if there was a conflcit and log a warning. sometime aound liberty/mitaka we decided that was a bad user experence for endusers as they did not recive what they asked for and started to convert all confict into a hard error. The only case where we intentionally allow the image to take prescedece over the flavor is hw:mem_page_size where it is allows if an only if the adming has set hw:mem_p[age_size to large or any expcltly. in other words unless the admin has opted in to allowing ther image to take precendece by not setting a value in the flavor or setint it to a value that allows the image to refine the choice we do not support image overriding flavors. the current code does exactly that by the use of a logical or flavor_hide_kvm = strutils.bool_from_string( flavor.get('extra_specs', {}).get('hide_hypervisor_id')) if (virt_type in ("qemu", "kvm") and (image_meta.properties.get('img_hide_hypervisor_id') or flavor_hide_kvm)): and the new code hide_hypervisor_id = (strutils.bool_from_string( flavor.extra_specs.get('hide_hypervisor_id')) or image_meta.properties.get('img_hide_hypervisor_id')) exibits the same behavior. in both cases if img_hide_hypervisor_id=true and hide_hypervisor_id=false hypervior hiding will be enabled. in this specific case the side-effects of this are safe but it may not be in all cases of this pattern. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1831723/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1831886] [NEW] default pcie hotplug behavior changes when using q35
Public bug reported: The q35 machine type support native pcie instead of legacy ahci based pci hotplug. This has several advantages and one majour disadvantage with the new pcie approch you need to pre allocate the pcie slot so that they will be available for use with hotplug if needed. to support this a new num_pcie_ports config option was added to the libvirt section of the nova.conf https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.num_pcie_ports the default value of this config is 0 which mean we use libvirts default. libvirts default is to allocate 1 free pcie port as a result by default you cannot attach more then 1 device without hard rebooting the vm. previously when using the pc machine type with the i440fx chipset it was possible to attach multiple interfaces or volumes. as a result the end user behavior has changed as observed by the failrure in tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port with the default setting and q35 enabled as reported in this downstream bug https://bugzilla.redhat.com/show_bug.cgi?id=1716356 to fix this the suggestion is to set the max value and default value of the num_pcie_ports config option to 32 based on some minimal local testing the memory usage of this change is ~0.4MB per port or ~12.5 mb per vm in addtion qemu overhead. this is based on testing done with libvirt directly with memroy preallocation enabeld for a 2G guest with the pc machinetype and i440fx chipset total memroy of 2036 MB was observed, q35 4 ports (the default value that will be calulated by libvirt for the default devices) increesed this to 2056 MB and q35 32 ports to 2066 MB as such this i a minimal overhead increase which can still be controlled by setting the config to a lower value explicitly. ** Affects: nova Importance: Low Assignee: Kashyap Chamarthy (kashyapc) Status: In Progress ** Tags: libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1831886 Title: default pcie hotplug behavior changes when using q35 Status in OpenStack Compute (nova): In Progress Bug description: The q35 machine type support native pcie instead of legacy ahci based pci hotplug. This has several advantages and one majour disadvantage with the new pcie approch you need to pre allocate the pcie slot so that they will be available for use with hotplug if needed. to support this a new num_pcie_ports config option was added to the libvirt section of the nova.conf https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.num_pcie_ports the default value of this config is 0 which mean we use libvirts default. libvirts default is to allocate 1 free pcie port as a result by default you cannot attach more then 1 device without hard rebooting the vm. previously when using the pc machine type with the i440fx chipset it was possible to attach multiple interfaces or volumes. as a result the end user behavior has changed as observed by the failrure in tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port with the default setting and q35 enabled as reported in this downstream bug https://bugzilla.redhat.com/show_bug.cgi?id=1716356 to fix this the suggestion is to set the max value and default value of the num_pcie_ports config option to 32 based on some minimal local testing the memory usage of this change is ~0.4MB per port or ~12.5 mb per vm in addtion qemu overhead. this is based on testing done with libvirt directly with memroy preallocation enabeld for a 2G guest with the pc machinetype and i440fx chipset total memroy of 2036 MB was observed, q35 4 ports (the default value that will be calulated by libvirt for the default devices) increesed this to 2056 MB and q35 32 ports to 2066 MB as such this i a minimal overhead increase which can still be controlled by setting the config to a lower value explicitly. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1831886/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1832169] Re: device_type of PCI alias config could be mismatched
the device_type is optional but if set it will be checked https://github.com/openstack/nova/blob/51e3787bf89f19af8a9d37288a63731563c92fca/nova/pci/request.py#L136-L138 type-pci is not intended for fore use with device that are capable of sriov and exits primarly for use with pci device that are not nics. type-PCI is reserved for device that will be passthough via the pci aliase in the falvor that should not be request able by neutron based sriov port. it is generally used for gpus, crypto cards like intel QAT devices or nics that are not managed by neutron and do not support sriov. type-PF is use for device that weill be request using neutron vnic_type=direct-physical. and type-VF is used for edvice that eill be requested using neutron vnic_type=direct. type-pf and type-pf may also be used for non nic device but in that case the physical_network tage must not be set in the pci whitelist. when we process a neutron prot we translate form the port vnic type to teh correct device_type here. https://github.com/openstack/nova/blob/212607dc6feaf311ba92295fd07363b3ee9ae010/nova/network/neutronv2/api.py#L2046-L2060 when enumarting the devices in the libvirt virt dirver here https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/driver.py#L6076-L6113 we interperat the device capabilities to determine if the which type to use when reporting the device. depending on the nic, firmware and and dirver otions the present of the virtual_fucntion capablity in the pci capablityis reported by libvirt can change. that is to say on older generation intel nicantics such as the intel 82599 series the presence of the virtual_fucntion cabpablity was ondition on if data center bridgeing was enabled in teh fireware. in data center bridgeing mode sriov was disabled to allow VMDQ to be used so even with the same vendor and product id the device type can change. when a device support sriov and is listed as a PF there are also addtional checks that the schduler and pci resouce tracker must perfrom to determing that a PF is availble for assignemnt to a vm. The most import being the pci resouce track must first confirm that the PF etiher has no VFs or that all VFs are free. For type-PCI we do not have to do that check as we know it does not support sriov and thereforce will not have VF that could be in use. ** Changed in: nova Importance: Undecided => Wishlist ** Changed in: nova Status: New => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1832169 Title: device_type of PCI alias config could be mismatched Status in OpenStack Compute (nova): Opinion Bug description: Currently, to use PCI passthrough functionality admin should specify the alias of PCI devices and the format is like below alias = { "vendor_id":"8086", "product_id":"1528", "device_type ":"type-PCI", "name":"nic" } What I think confusing for this configuration is that there is just one "device_type" for the device. I assume that device_type is not needed to the device be identified since libvirt made the device_type for one device. IOW, I suspects it never happens like below. alias = { "vendor_id":"8086", "product_id":"1528", "device_type":"type-PCI", "name":"nic" } alias = { "vendor_id":"8086", "product_id":"1528", "device_type":"type-PF", "name":"nic" } I strongly believe the PCI device having 8086:1528 ID is just already set the unique device_type., I'm not 100% sure though. So my point is it's better to delete device_type attribute for the config so that admin does not care about the device type. I think it's big barrier to use PCI passthrough functionality for whom does not familiar with the concept. Thanks. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1832169/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1824048] Re: SRIOV pci_numa_policy dosen't working when create instance with 'cpu_policy' and 'num_nodes'
*** This bug is a duplicate of bug 1805891 *** https://bugs.launchpad.net/bugs/1805891 ** This bug has been marked a duplicate of bug 1805891 pci numa polices are not followed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1824048 Title: SRIOV pci_numa_policy dosen't working when create instance with 'cpu_policy' and 'num_nodes' Status in OpenStack Compute (nova): New Bug description: Description === When we create a sriov instance, which flavor has 'cpu_policy', 'numa_nodes' property, it's also has 'pci_numa_policy=preferred' property that indicate we are able to allocate pci devices and vcpu in different numa node. However, in some cases, it didn't work, because the fake pci request, which produced by nova flavor and [pci] alias in nova.conf hasn't write related information (such as pci_numa_policy, alias_name and some spec info) into real pci_requests (which contain port_id). So, in nova/pci/stats.py function 'def _filter_pools_for_numa_cells', there will filter all pci devices. Environment === Openstack Queen compute node information: Two numa nodes(node-0 node-1), SRIOV-PCI devices associated with NUMA node-1, but cpus of node-1 have run out. Steps to reproduce == nova.conf [pci] alias = {"name": "QuickAssist","product_id": "10ed","vendor_id": "8086","device_type": "type-VF","numa_policy": "preferred"} nova flavor ++-+ | Property | Value | ++-+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | disk | 20 | | extra_specs| {"hw:pci_numa_policy": "preferred", "hw:cpu_policy": "dedicated", "hw:numa_nodes": "1", "hw:cpu_cores": "4", "pci_passthrough:alias": "QuickAssist:1"} | | id | 430e1afd-a72b-41c6-b9b2-ea9b6aa9f037 | | name | multiqueue | | os-flavor-access:is_public | True | | ram| 2048 | | rxtx_factor| 1.0 | | swap | | | vcpus | 4 | ++-+ neutron port: one or some 'direct' ports; Expected result === The instance coul
[Yahoo-eng-team] [Bug 1805891] Re: pci numa polices are not followed
as this feature never worked on rocky and queens i am marking it as wont fix as it would be effectivly a feature backport based on matt's comment here https://review.opendev.org/#/c/641653/1//COMMIT_MSG@13 ** Also affects: nova/rocky Importance: Undecided Status: New ** Also affects: nova/queens Importance: Undecided Status: New ** Also affects: nova/stein Importance: Undecided Status: New ** Changed in: nova/stein Status: New => Fix Released ** Changed in: nova/rocky Status: New => Won't Fix ** Changed in: nova/queens Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1805891 Title: pci numa polices are not followed Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) queens series: Won't Fix Status in OpenStack Compute (nova) rocky series: Won't Fix Status in OpenStack Compute (nova) stein series: Fix Released Bug description: Description === https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/share-pci-between-numa-nodes.html introduced the concept of numa affinity policies for pci passthough devices. upon testing it was observed that the prefer policy is broken. for contested there is a sperate bug to track the lack of support for neutron sriov interfaces. https://bugs.launchpad.net/nova/+bug/1795920 so the scope of this bug is limited pci numa policies for passtrhough devices using a flavor alias. background -- by default in nova pci devices are numa affinitesed using the legacy policy. but you can override this behavior via the alias. when set to prefer nova should fall back to no numa affintiy bwteen the guest and the pci devce if a device on a local numa node is not availeble. the policies are discibed below. legacy This is the default value and it describes the current nova behavior. Usually we have information about association of PCI devices with NUMA nodes. However, some PCI devices do not provide such information. The legacy value will mean that nova will boot instances with PCI device if either: The PCI device is associated with at least one NUMA nodes on which the instance will be booted There is no information about PCI-NUMA affinity available preferred This value will mean that nova-scheduler will choose a compute host with minimal consideration for the NUMA affinity of PCI devices. nova-compute will attempt a best effort selection of PCI devices based on NUMA affinity, however, if this is not possible then nova-compute will fall back to scheduling on a NUMA node that is not associated with the PCI device. Note that even though the NUMATopologyFilter will not consider NUMA affinity, the weigher proposed in the Reserve NUMA Nodes with PCI Devices Attached spec [2] can be used to maximize the chance that a chosen host will have NUMA-affinitized PCI devices. Steps to reproduce == the test case was relitively simple - deploy a singel node devstack install on a host with 2 numa nodes. - enable the pci and numa topology fileters - whitelist a pci device attach to numa_node 0 e.g. passthrough_whitelist = { "address": ":01:00.1" } - adust the vcpu_pin_set to only list the cpus on numa_node 1 e.g. vcpu_pin_set=8-15 - crate an alias in the pci section of the nova.conf alias = { "vendor_id":"8086", "product_id":"10c9", "device_type":"type-PF", "name":"nic-pf", "numa_policy": "preferred"} - restart the nova services sudo systemctl restart devstack@n-* - update a flavour with the alias and a numa toplogy of 1 openstack flavour set --property pci_passthrough:alias='nic-pf:1' 42 openstack flavour set --property hw:numa_nodes=1 42 ++-+ | Field | Value | ++-+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 0 | | id | 42 | | name | m1.nano | | os-flavor-access:is_public | True | | properties | hw:numa_nodes='1', pci_passthrough:alias='nic-pf:1' | | ram| 64
[Yahoo-eng-team] [Bug 1802973] Re: Failed to create VM with no IP assigned to SR-IOV port
Nova does not currently have support for neutron ports without an ip. when support was added for the neutron port ip_allocation polices only support for intimidate and defer where implemented. i believe work is planned to add support for addresses port in train but i am close this as invalid as it had never been supported. ** Tags added: neutr ** Tags removed: neutr ** Tags added: libvirt neutron ** Changed in: nova Importance: Undecided => Wishlist ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1802973 Title: Failed to create VM with no IP assigned to SR-IOV port Status in OpenStack Compute (nova): Invalid Bug description: Description === Failed to create an instance because of a port attached with no IP address assigned. The port is created over a flat network mapped to a SR-IOV interface. Steps to reproduce == A chronological list of steps which will bring off the issue you noticed: 1. Create network openstack network create --provider-physical-network physnet1 --provider-network-type flat sriov-net 2. Create port openstack port create --network --vnic-type macvtap (direct) --no-fixed-ip sriov-port 3. Create openstack server create --image cloud.img --flavor your_flavor --key-name ssh-key --port sriov-port vm_name Expected result === The instance should start with a layer 2 interface configured over a sr-iov virtual function. Actual result = Port 6dca94cb-1ed5-4131-bc69-4736db5f9f18 requires a FixedIP in order to be used. (HTTP 400) Environment === 1. Openstack Rocky managed via Juju and deployed using MAAS. 2. Which hypervisor did you use? Libvirt + KVM 2. Which storage type did you use? Ceph 3. Which networking type did you use? Neutron To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1802973/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1768917] Re: PCI-Passthrough documentation is incorrect while trying to pass through a NIC
you are conflating two different things alias are not used for neutron based sriov networking and nic that are pastthoyhg via flaovr aliase are not managed by neutorn or the sriov nic agent. the documentation in https://docs.openstack.org/nova/pike/admin/pci- passthrough.html describe how to do generic pci pasthoh of a host pci device not neutron sriov driect-phyical passthough. to give a PF to a vm that is managed by neutron you create a port with vnic_type=direct-physical. in that scenario when whitelisting the nic you also need to add the physical_network in the whitelist. the flavor and alias based approach described in https://docs.openstack.org/nova/pike/admin/pci-passthrough.html is intened for passing through device like gpus or acllerator cards like intel qat devices. the docs use the vendor and product ids of an intel niantic simple because that is what we tested this functionality with when it was implemented but we could have used a QAT device in the example which not work with neturon sriov. | [pci] | alias = '{ | "name": "QuickAssist", | "product_id": "0443", | "vendor_id": "8086", | "device_type": "type-PCI", | "numa_policy": "legacy" | }' ** Changed in: nova Importance: Undecided => Low ** Changed in: nova Status: New => Invalid ** Tags added: docs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1768917 Title: PCI-Passthrough documentation is incorrect while trying to pass through a NIC Status in OpenStack Compute (nova): Invalid Bug description: As per the documentation shown below https://docs.openstack.org/nova/pike/admin/pci-passthrough.html In order to achieve PCI passthrough of a network device, it states that we should create a 'flavor' based on the alias and then associate a flavor to the server create function. Steps to follow: Create an Alias: [pci] alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" } Create a Flavor: [pci] alias = { "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" } Add a whitelist: [pci] passthrough_whitelist = { "address": ":41:00.0" } Create a Server with the Flavor: # openstack server create --flavor m1.large --image cirros-0.3.5-x86_64-uec --wait test-pci With the above command, the VM creation errors out and we see a PortBindingFailure. The reason for the PortBindingFailure is the 'vif_type' is always set to 'BINDING_FAILED". The reason being, flavor does not mention about the 'vnic_type '='direct-physical' without this information the sriov mechanism driver is not able to bind the port. Not sure if there is any way to specify the info in the flavor. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1768917/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1768919] Re: PCI-Passthrough fails when we have Flavor configured and provide a port with vnic_type=direct-physical
*** This bug is a duplicate of bug 1768917 *** https://bugs.launchpad.net/bugs/1768917 i have closed this as a duplicate as i explain in the other bug that you miss understood how to use this feature. based on teh in fomation you provdied on the ohter bug i am assumin you have only one nic avaiable on the host and you are requesting it twice 1 via the alais and again via the neutron port. that is inccorect. you need 1 device for each request. to use neutron PF pasthough (vnic_type=direct-physical) you should not also specify a flavor alais unless you are using that to request a different device. noav will convert a port with vnic_type=direct-phyical into a pci request internally. ** This bug has been marked a duplicate of bug 1768917 PCI-Passthrough documentation is incorrect while trying to pass through a NIC -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1768919 Title: PCI-Passthrough fails when we have Flavor configured and provide a port with vnic_type=direct-physical Status in OpenStack Compute (nova): New Bug description: PCI-Passthrough of a NIC device to the VM fails, when we have both the Flavor configured with Alias and also provide a network port with 'vnic_type=direct-physical'. The comment shown in the source code shown below, https://github.com/openstack/nova/blob/644ac5ec37903b0a08891cc403c8b3b63fc2a91c/nova/compute/api.py#L812 # PCI requests come from two sources: instance flavor and # requested_networks. The first call in below returns an # InstancePCIRequests object which is a list of InstancePCIRequest # objects. The second call in below creates an InstancePCIRequest # object for each SR-IOV port, and append it to the list in the # InstancePCIRequests object In this case there would be two PCI-requests for the same device and _test_pci fails when the compute tries to check for the Claims. 088d81f6653242318245b137b1ef91c7] _test_pci /opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:201 2018-04-30 22:17:06.058 13396 DEBUG nova.compute.claims [req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 088d81f6653242318245b137b1ef91c7] pci requests: [InstancePCIRequest(alias_name='intel10fb',count=1,is_new=False,request_id=None,spec=[{dev_type='type-PF',product_id='10fb',vendor_id='8086'}]), InstancePCIRequest(alias_name=None,count=1,is_new=False,request_id=13befe5f-478f-4f4c-aa72-78cce84d942d,spec=[{dev_type='type-PF',physical_network='physnet2'}])] _test_pci /opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:202 2018-04-30 22:17:06.059 13396 DEBUG nova.compute.claims [req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 088d81f6653242318245b137b1ef91c7] PCI request stats failed _test_pci /opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/nova/compute/claims.py:206 2018-04-30 22:17:06.059 13396 DEBUG oslo_concurrency.lockutils [req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 088d81f6653242318245b137b1ef91c7] Lock "compute_resources" released by "nova.compute.resource_tracker.instance_claim" :: held 0.059s inner /opt/stack/venv/nova-20180424T164716Z/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282 2018-04-30 22:17:06.060 13396 DEBUG nova.compute.manager [req-c7689c16-227a-462e-aad5-4c462036051c df7bd0a08ee64da981574d7a7d76970a 088d81f6653242318245b137b1ef91c7] [instance: 39ad3a47-66dc-4114-9653-fee5ee0c87dc] Insufficient compute resources: Claim pci failed.. Not sure why the Claim pci failed for the same device entry twice. Probably if the device id is the same on both Flavor and network, then it should only compose one entry since they both are identical. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1768919/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1817683] Re: Key pair not imported when passing cloud-init script on initiation
i belive you are correct that this behavior is caused by the fact you are creating a usever via cloud init. if you think this is really a nova bug feel free to set the status back to New for the bug to be retriaged. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1817683 Title: Key pair not imported when passing cloud-init script on initiation Status in OpenStack Compute (nova): Invalid Bug description: Description === The public SSH key is not imported when an instance is created with a key pair (key pair tab) + cloud-init script (configuration tab) - Reproduced in dashboard (Horizon) - Reproduced with python (nova.server.create()) Steps to reproduce == - Create an instance in the GUI - with a key pair Key pair is inserted [ 22.212331] cloud-init[993]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Tue, 26 Feb 2019 09:44:27 +. Up 21.13 seconds. [[0;32m OK [0m] Started Apply the settings specified in cloud-config. Starting Execute cloud user/final scripts... ci-info: +Authorized keys from /home/ubuntu/.ssh/authorized_keys for user ubuntu++ ci-info: +-+-+-+-+ ci-info: | Keytype |Fingerprint (md5)| Options | Comment | ci-info: +-+-+-+-+ ci-info: | ssh-rsa | 36:b4:ea:45:0a:77:c4:87:c9:71:d5:78:6e:a5:ee:ba |- |-| ci-info: +-+-+-+-+ => login to VM with key pair -> Login successful - Create a second instance - with a key pair - pass a cloud-init script in the user configuration #cloud-config chpasswd: expire: false list: | root:toor jelle:jelle users: - name: jelle lock-passwd: false sudo: ['ALL=(ALL) NOPASSWD:ALL'] groups: sudo shell: /bin/bash ==> Public key from the key-pair is not imported [ 21.472835] cloud-init[937]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:config' at Tue, 26 Feb 2019 09:36:21 +. Up 20.47 seconds. [[0;32m OK [0m] Started Apply the settings specified in cloud-config. Starting Execute cloud user/final scripts... ci-info: no authorized ssh keys fingerprints found for user jelle. <14>Feb 26 09:36:23 ec2: <14>Feb 26 09:36:23 ec2: # <14>Feb 26 09:36:23 ec2: -BEGIN SSH HOST KEY FINGERPRINTS- <14>Feb 26 09:36:23 ec2: 1024 SHA256:mfFrY4zKFLuJPRF6Pw6z8suzBzA7jx21sife3MwEee4 root@test (DSA) <14>Feb 26 09:36:23 ec2: 256 SHA256:JzA4J0A6oN5c1vTiGpTPBgqisb1IlxXBumlnk/Jg1Po root@test (ECDSA) <14>Feb 26 09:36:23 ec2: 256 SHA256:j/mU93YAfgHxdrXJD0QT6SMFFoOzRvtES/YZ+9ZBNaM root@test (ED25519) <14>Feb 26 09:36:23 ec2: 2048 SHA256:Hy1gMvK/7hSoyIacAgx+C/jEHkbCi5yS9YbiYfcTVGo root@test (RSA) <14>Feb 26 09:36:23 ec2: -END SSH HOST KEY FINGERPRINTS- <14>Feb 26 09:36:23 ec2: # -BEGIN SSH HOST KEY KEYS- ecdsa-sha2-nistp256 E2VjZHNhLXNoYTItbmlzdHAyNTYIbmlzdHAyNTYAAABBBGBMYWNnP97Znq6Al0LHqzUu8tOa3/T4fuh+PLAIW26b2361MarI/1HxxseRmCUgb45Gw5zXu7CfLhAlHaThirk= root@test ssh-ed25519 C3NzaC1lZDI1NTE5IJ54epYzeKPsUs8UXyac+nTPQGpNY2CQWwBQL4aEPZD6 root@test ssh-rsa B3NzaC1yc2EDAQABAAABAQCwtmWLjZrRB4BVxcWAZt8/uWkkQhMCkrdNQTS40ZGTGto46MyBmyA+4RJxnZ8MV9I/8lpBt1EY5ERdf/5gDwN51wzq57LVuTz46mhYU3i85YECaE98VXG9I52OC0/UzgvlEbwEbVPlMh+ZVkNSkZu4Mcuvi0hvzU7+Z5p8CvWEMhIvtWAKbf/ujK0WzeYRwsqQfGm5hUH6TJSjFRCC/T1DosnM+hgDlNkiYGjlUE9LvSPRTX1rMfakUbWzK/EJWuGuYO21P/oORNDeJxWPZS/Y8cW+VCQbXCuXqXFst347Tvnl/kmZULjRJjB05eAV6Ejto2tRbCku49POA26/GzMj root@test -END SSH HOST KEY KEYS- [ 22.295189] cloud-init[995]: Cloud-init v. 18.4-0ubuntu1~18.04.1 running 'modules:final' at Tue, 26 Feb 2019 09:36:23 +. Up 22.06 seconds. [ 22.299328] cloud-init[995]: ci-info: no authorized ssh keys fingerprints found for user jelle. [ 22.301658] cloud-init[995]: Cloud-init v. 18.4-0ubuntu1~18.04.1 finished at Tue, 26 Feb 2019 09:36:23 +. Datasource DataSourceOpenStackLocal [net,ver=2]. Up 22.27 seconds => Login with keypair -> login fails Environment === ubuntu@juju-5dc387-0-lxd-6:~$ nova-manage --version 15.1.5 ubuntu@juju-5dc387-0-lxd-6:~$ dpkg -l | grep nova ii nova-api-os-compute 2:15.1.5-0ubuntu1~cloud0 all OpenStack Compute - OpenStack Compute API frontend ii nova-common
[Yahoo-eng-team] [Bug 1815762] Re: you can end up in a state where qvo* interfaces aren't owned by ovs which results in a dangling connection
this might be somthing that could be added to the exsiting neutron-ovs-cleanup script that is generated by this entry point https://github.com/openstack/neutron/blob/master/setup.cfg#L49 and impmeneted here https://github.com/openstack/neutron/blob/master/neutron/cmd/ovs_cleanup.py but this should not live in nova. ** Also affects: neutron Importance: Undecided Status: New ** Changed in: nova Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1815762 Title: you can end up in a state where qvo* interfaces aren't owned by ovs which results in a dangling connection Status in neutron: New Status in OpenStack Compute (nova): Won't Fix Bug description: While upgrading to rocky, we ended up with a broken openvswitch infrastructure and moved back to the old openvswitch. We ended up with new machines working, old machines didn't and it took a while to realize that we had qvo* interfaces that not only wasn't plugged but also wasn't owned by ovs-system - basically the virtual equivalent of forgetting to plug in the cable ;) This was quickly addressed by running this bash-ism on all nodes: for x in `ip a |grep qvo |grep @qvb |grep -v ovs-system | awk '{ print $2 '}` ; do y=${x%%"@"*} && ip link delete $y ; done ; docker restart nova_compute However, nova could pretty easily sanity check this =) To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1815762/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1813446] Re: impl_rabbit timedout
making as invalid as there appears to be sever different issues from you database to the kernel locking up which seam unrelated to nova. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1813446 Title: impl_rabbit timedout Status in OpenStack Compute (nova): Invalid Bug description: I have problems running instance after creation it says hosts not found it was because the mysql had gone so I applied this fix I found on the internet mysql --max_allowed_packet=25G !/usr/bin/env python2.7 import time import mysql.connector now I have this error , traced it: Error: Unable to create the server. Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. (HTTP 500) (Request-ID: req-165fa813-f601-4aed-b584-bf847ae764b7) also it shows in nova-api.log 2019-01-26 22:24:54.143 113216 ERROR nova.api.openstack.wsgi 2019-01-26 22:24:54.450 113216 INFO nova.api.openstack.wsgi [req-165fa813-f601-4aed-b584-bf847ae764b7 c48c372dabe14b24aeec0408d345f30d d159ec3920b94490a9a85ed183482acc - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. [root@amer swift(keystone_admin)]# I have a VM of 8G 8 cores and 1 TB HDD pinging is successful from my host to the outside world: [root@amer swift(keystone_admin)]# ping google.com PING google.com (172.217.19.46) 56(84) bytes of data. 64 bytes from ham02s11-in-f46.1e100.net (172.217.19.46): icmp_seq=1 ttl=53 time=83.6 ms 64 bytes from ham02s11-in-f46.1e100.net (172.217.19.46): icmp_seq=2 ttl=53 time=87.8 ms I can ping myself also: [root@amer swift(keystone_admin)]# ping amer.example.com PING amer.example.com (192.168.43.110) 56(84) bytes of data. 64 bytes from amer.example.com (192.168.43.110): icmp_seq=1 ttl=64 time=0.044 ms 64 bytes from amer.example.com (192.168.43.110): icmp_seq=2 ttl=64 time=0.053 ms 64 bytes from amer.example.com (192.168.43.110): icmp_seq=3 ttl=64 time=0.046 ms 64 bytes from amer.example.com (192.168.43.110): icmp_seq=4 ttl=64 time=0.082 ms openstack compute service list gives: [root@amer swift(keystone_admin)]# openstack compute service list ++--+--+--+-+---++ | ID | Binary | Host | Zone | Status | State | Updated At | ++--+--+--+-+---++ | 4 | nova-conductor | amer.example.com | internal | enabled | up| 2019-01-27T03:43:17.00 | | 5 | nova-scheduler | amer.example.com | internal | enabled | up| 2019-01-27T03:43:24.00 | | 7 | nova-consoleauth | amer.example.com | internal | enabled | up| 2019-01-27T03:43:15.00 | | 8 | nova-compute | amer.example.com | nova | enabled | up| 2019-01-27T03:43:18.00 | ++--+--+--+-+---++ [root@amer swift(keystone_admin)]# To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1813446/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1802218] Re: An instance created by openstack rocky can't be remotely connected by xshell and putty, But SSH tools for Linux systems do. [server's host key did not match the sign
marking as invalid as this is likely an issue wiht the ssh client you are using ro the ssh server in the guest. the fact it works on linux but not windows/android suggest to me it might be related to the authentication methods or encryption algothims and key types supproted in the client/server and is likely not related to openstack/nova ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1802218 Title: An instance created by openstack rocky can't be remotely connected by xshell and putty, But SSH tools for Linux systems do. [server's host key did not match the signature supplied] Status in OpenStack Compute (nova): Invalid Bug description: The mirror images adopted include: 1. Use xshell, putty, JuiceSSH cannot connect [http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud-1805.qcow2] 2. Cannot connect with xshell, putty and JuiceSSH [centos7.4 made by myself, no problem with Ocata version built before] 3. With xshell, putty and JuiceSSH, you can connect [http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img] After the failure, attempts were made to recreate the SSH keys in the system, but to no avail. The order is as follows: [root@host-192-168-1-10 ~]# rm -f /etc/ssh/ssh_host_* [root@host-192-168-1-10 ~]# systemctl restart sshd.service It environment [root@all-in-one-202 ~]# rpm -qa | grep rocky centos-release-openstack-rocky-1-1.el7.centos.noarch [root@all-in-one-202 ~]# [root@all-in-one-202 ~]# rpm -qa | grep nova openstack-nova-conductor-18.0.2-1.el7.noarch openstack-nova-console-18.0.2-1.el7.noarch openstack-nova-api-18.0.2-1.el7.noarch python2-novaclient-11.0.0-1.el7.noarch openstack-nova-common-18.0.2-1.el7.noarch openstack-nova-placement-api-18.0.2-1.el7.noarch openstack-nova-compute-18.0.2-1.el7.noarch openstack-nova-novncproxy-18.0.2-1.el7.noarch openstack-nova-scheduler-18.0.2-1.el7.noarch python-nova-18.0.2-1.el7.noarch [root@all-in-one-202 ~]# [root@all-in-one-202 ~]# rpm -qa | egrep -i "libvirt|kvm" libvirt-daemon-driver-network-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-scsi-3.9.0-14.el7_5.8.x86_64 libvirt-libs-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-disk-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-nwfilter-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-logical-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-nodedev-3.9.0-14.el7_5.8.x86_64 qemu-kvm-common-ev-2.10.0-21.el7_5.7.1.x86_64 libvirt-daemon-driver-storage-core-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-rbd-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-mpath-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-secret-3.9.0-14.el7_5.8.x86_64 qemu-kvm-ev-2.10.0-21.el7_5.7.1.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-interface-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-kvm-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-qemu-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-3.9.0-14.el7_5.8.x86_64 libvirt-client-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-3.9.0-14.el7_5.8.x86_64 libvirt-daemon-driver-storage-iscsi-3.9.0-14.el7_5.8.x86_64 libvirt-python-3.9.0-1.el7.x86_64 [root@all-in-one-202 ~]# [root@all-in-one-202 ~]# rpm -qa | grep neutron python2-neutron-lib-1.18.0-1.el7.noarch python2-neutronclient-6.9.1-1.el7.noarch openstack-neutron-common-13.0.1-2.el7.noarch openstack-neutron-ml2-13.0.1-2.el7.noarch openstack-neutron-13.0.1-2.el7.noarch python-neutron-13.0.1-2.el7.noarch openstack-neutron-linuxbridge-13.0.1-2.el7.noarch To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1802218/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky
** Changed in: os-vif Status: New => Invalid ** Changed in: nova Status: New => In Progress ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova Importance: Undecided => Medium -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1815989 Title: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky Status in neutron: In Progress Status in OpenStack Compute (nova): In Progress Status in os-vif: Invalid Bug description: This issue is well known, and there were previous attempts to fix it, like this one https://bugs.launchpad.net/neutron/+bug/1414559 This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers. So far the only simply fix I have is to increase the number of RARP packets QEMU sends after live-migration from 5 to 10. To be complete, the nova change (not merged) proposed in the above mentioned activity does not work. I am creating this ticket hoping to get an up-to-date (for Rockey and onwards) expert advise on how to fix in nova-neutron. For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through. openvswitch-agent.log: 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port 57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {}, 'network_qos_policy_id': None, 'qos_policy_id': None, 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648, 'fixed_ips': [ {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': '10.0.1.4'} ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:3e:de:af:47', 'device': u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']} 2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42 tcpdump for rarp packets: [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1774252] Re: Resize confirm fails if nova-compute is restarted after resize
*** This bug is a duplicate of bug 1774249 *** https://bugs.launchpad.net/bugs/1774249 ** This bug has been marked a duplicate of bug 1774249 update_available_resource will raise DiskNotFound after resize but before confirm -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1774252 Title: Resize confirm fails if nova-compute is restarted after resize Status in OpenStack Compute (nova): New Bug description: Originally reported in RH bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1584315 Reproduced on OSP12 (Pike). After resizing an instance but before confirm, update_available_resource will fail on the source compute due to bug 1774249. If nova compute is restarted at this point before the resize is confirmed, the update_available_resource period task will never have succeeded, and therefore ResourceTracker's compute_nodes dict will not be populated at all. When confirm calls _delete_allocation_after_move() it will fail with ComputeHostNotFound because there is no entry for the current node in ResourceTracker. The error looks like: 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [req-4f7d5d63-fc05-46ed-b505-41050d889752 09abbd4893bb45eea8fb1d5e40635339 d4483d13a6ef41b2ae575ddbd0c59141 - default default] [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Setting instance vm_state to ERROR: ComputeHostNotFound: Compute host compute-1.localdomain could not be found. 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] Traceback (most recent call last): 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7445, in _error_out_instance_on_exception 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] yield 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3757, in _confirm_resize 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] migration.source_node) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3790, in _delete_allocation_after_move 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] cn_uuid = rt.get_node_uuid(nodename) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 155, in get_node_uuid 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] raise exception.ComputeHostNotFound(host=nodename) 2018-05-30 13:42:19.239 1 ERROR nova.compute.manager [instance: 1374133a-2c08-4a8f-94f6-729d4e58d7e0] ComputeHostNotFound: Compute host compute-1.localdomain could not be found. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1774252/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1835822] [NEW] vms loose acess to config drive with CONF.force_config_drive=True after hard reboot
Public bug reported: The fix to bug https://bugs.launchpad.net/nova/+bug/1827492 https://review.opendev.org/#/c/659703/8 changed the behavior of nova.virt.configdrive.required_by to depend on instance.launched_at https://review.opendev.org/#/c/659703/8/nova/virt/configdrive.py@196 but did not reorder https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758 as a result when nova.compute.manager._update_instance_after_spawn is called instance.launched_at is always set before we call nova.virt.configdrive.update_instance as a result instance.config_drive will always be set to false if not set on the api. this results in a vm that is spawned on a host with force_config_drive=True initally spawning with a config drive but loosing it after a hard reboot. for any deployment that uses config driver for vendor data or device role tagging because they do not deploy the metadata service this is a regressions as they cannot fall back to the metadta service. this also might cause issue for deployment that support the deprectated file injection api that is part of the v2.1 api as the files are only stored in the config drive and are not part metadta endoint note: i have not checked if we autoset instnace.config_drive when you use file injection or not so it may be unaffected since the breakage of the other support uscases is enough to justify this bug. the fix is simple jsut swap the order of https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758 and then instance will have there instnace.config_drive value set correctly when they first boot and it will be sticky for the lifetime of the instance. ** Affects: nova Importance: Medium Assignee: sean mooney (sean-k-mooney) Status: Confirmed ** Tags: libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1835822 Title: vms loose acess to config drive with CONF.force_config_drive=True after hard reboot Status in OpenStack Compute (nova): Confirmed Bug description: The fix to bug https://bugs.launchpad.net/nova/+bug/1827492 https://review.opendev.org/#/c/659703/8 changed the behavior of nova.virt.configdrive.required_by to depend on instance.launched_at https://review.opendev.org/#/c/659703/8/nova/virt/configdrive.py@196 but did not reorder https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758 as a result when nova.compute.manager._update_instance_after_spawn is called instance.launched_at is always set before we call nova.virt.configdrive.update_instance as a result instance.config_drive will always be set to false if not set on the api. this results in a vm that is spawned on a host with force_config_drive=True initally spawning with a config drive but loosing it after a hard reboot. for any deployment that uses config driver for vendor data or device role tagging because they do not deploy the metadata service this is a regressions as they cannot fall back to the metadta service. this also might cause issue for deployment that support the deprectated file injection api that is part of the v2.1 api as the files are only stored in the config drive and are not part metadta endoint note: i have not checked if we autoset instnace.config_drive when you use file injection or not so it may be unaffected since the breakage of the other support uscases is enough to justify this bug. the fix is simple jsut swap the order of https://github.com/openstack/nova/blob/86524773b8cd3a52c98409c7ca183b4e1873e2b8/nova/compute/manager.py#L1757-L1758 and then instance will have there instnace.config_drive value set correctly when they first boot and it will be sticky for the lifetime of the instance. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1835822/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1836105] Re: Instance does not start - Error during following call to agent: ovs-vsctl
can you check that the ovs-dpdk is actully working on the host. if you do "ps aux | grep ovs" do you see the ovs-vswitchd or ovs-db running? if so please run "ovs-vsctl show" and "ovs-ofctl dump-flows br-int" to confirm your ovs is actully functional via the commandline. im not familar with the ubuntu charms for ovs but its possible that they configred ovs to listen on tcp only if that is the case then you either need to configure it to work with the clis too or configre os-vif and neutron to use tcp. i dont think os-vif supports tcp in queens however. this looks like a charms issue with how the charm deployed ovs-dpdk not a nova bug so we proably shoudl re target this bug. ** Changed in: nova Importance: Undecided => Low ** Changed in: nova Status: New => Incomplete ** Also affects: charm-neutron-openvswitch Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1836105 Title: Instance does not start - Error during following call to agent: ovs- vsctl Status in OpenStack neutron-openvswitch charm: New Status in OpenStack Compute (nova): Incomplete Bug description: This is Openstack Queens on Bionic. The main difference from templates is no neutron-gateway (provider network only) and use of DPDK. There are other issues under investigation about dpdk and checksumming but they don't seem related to this at first look. https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/1833713 - Instances cannot be started once they are shutdown - It's happening to every instance after the problem first appeared - It's happening on different hosts - Any try to start will timeout with errors in nova log (bellow) - New instances can be created and they boot ok - Nothing new appears in openvswitch logs with normal debugging level - Nothing new appears on libvirt logs for the instance (last status is from last boot) 2019-07-10 13:40:42.013 19975 ERROR oslo_messaging.rpc.server InternalError: Failure running os_vif plugin plug method: Failed to plug VIF VIFVHostUser(active=True,address=fa:16:3e:8e:8f:9b,has_traffic_filtering=False,id=ab6225f4-1cd8-43c7-8777-52c99ae80f67,mode='server',network=Network (d8249c3d-03d9-44ac-8eae-fa967993c73d),path='/run/libvirt-vhost- user/vhuab6225f4-1c',plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=True,vif_name='vhuab6225f4-1c'). Got error: Error during following call to agent: ['ovs-vsctl', '-- timeout=120', '--', '--if-exists', 'del-port', u'vhuab6225f4-1c', '--', 'add-port', u'br-int', u'vhuab6225f4-1c', '--', 'set', 'Interface', u'vhuab6225f4-1c', u'external-ids:iface- id=ab6225f4-1cd8-43c7-8777-52c99ae80f67', 'external-ids:iface- status=active', u'external-ids:attached-mac=fa:16:3e:8e:8f:9b', u 'external-ids:vm-uuid=5e46868f-8a52-4d70-b08a-9a320dc9821b', 'type=dpdkvhostuserclient', u'options:vhost-server-path=/run/libvirt- vhost-user/vhuab6225f4-1c'] 2019-07-10 13:43:05.511 19975 ERROR os_vif AgentError: Error during following call to agent: ['ovs-vsctl', '--timeout=120', '--', '--if- exists', 'del-port', u'vhuab6225f4-1c', '--', 'add-port', u'br-int', u'vhuab6225f4-1c', '--', 'set', 'Interface', u'vhuab6225f4-1c', u 'external-ids:iface-id=ab6225f4-1cd8-43c7-8777-52c99ae80f67', 'external-ids:iface-status=active', u'external-ids:attached- mac=fa:16:3e:8e:8f:9b', u'external-ids:vm-uuid=5e46868f-8a52-4d70 -b08a-9a320dc9821b', 'type=dpdkvhostuserclient', u'options:vhost- server-path=/run/libvirt-vhost-user/vhuab6225f4-1c'] Complete logs will follow. To manage notifications about this bug go to: https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1836105/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1837252] Re: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges
triaging as high as folding could lead to network disruption to guests on multiple hosts. i have root caused this as a result of combining the code into a single shared codepath between the ovs and linux bridge plugin for ovs hybrid plug we set the ageing to 0 to prevent packet loss during live migation https://github.com/openstack/os- vif/commit/fa4ff64b86e6e1b6399f7250eadbee9775c22d32#diff- f55bc78ffb4c1bbf81b88bf68673 however this is not valid for linux bridge in general https://github.com/openstack/os-vif/commit/1f6fed6a69e9fd386e421f3cacae97c11cdd7c75#diff-010d1833da7ca175fffc8c41a38497c2 which replace the use of brctl in the linux bridge driver resued the common code i introduced in https://github.com/openstack/os-vif/commit/5027ce833c6fccaa80b5ddc8544d262c0bf99dbd#diff- cec1a2ac6413663c344b607129c39fab and as a result it picked up the ovs ageing code which was not intentinal. ill fix this shortly and backport it. ** Changed in: os-vif Importance: Undecided => High ** Changed in: os-vif Status: New => Confirmed ** Changed in: os-vif Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova Status: New => Invalid ** Changed in: neutron Status: Incomplete => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1837252 Title: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges Status in neutron: Invalid Status in OpenStack Compute (nova): Invalid Status in os-vif: Confirmed Bug description: Release: OpenStack Stein Driver: LinuxBridge Using Stein w/ the LinuxBridge mech driver/agent, we have found that traffic is being flooded across bridges. Using tcpdump inside an instance, you can see unicast traffic for other instances. We have confirmed the macs table shows the aging timer set to 0 for permanent entries, and the bridge is NOT learning new MACs: root@lab-compute01:~# brctl showmacs brqd0084ac0-f7 port no mac addris local? ageing timer 5 24:be:05:a3:1f:e1 yes0.00 5 24:be:05:a3:1f:e1 yes0.00 1 fe:16:3e:02:62:18 yes0.00 1 fe:16:3e:02:62:18 yes0.00 7 fe:16:3e:07:65:47 yes0.00 7 fe:16:3e:07:65:47 yes0.00 4 fe:16:3e:1d:d6:33 yes0.00 4 fe:16:3e:1d:d6:33 yes0.00 9 fe:16:3e:2b:2f:f0 yes0.00 9 fe:16:3e:2b:2f:f0 yes0.00 8 fe:16:3e:3c:42:64 yes0.00 8 fe:16:3e:3c:42:64 yes0.00 10 fe:16:3e:5c:a6:6c yes0.00 10 fe:16:3e:5c:a6:6c yes0.00 2 fe:16:3e:86:9c:dd yes0.00 2 fe:16:3e:86:9c:dd yes0.00 6 fe:16:3e:91:9b:45 yes0.00 6 fe:16:3e:91:9b:45 yes0.00 11 fe:16:3e:b3:30:00 yes0.00 11 fe:16:3e:b3:30:00 yes0.00 3 fe:16:3e:dc:c3:3e yes0.00 3 fe:16:3e:dc:c3:3e yes0.00 root@lab-compute01:~# bridge fdb show | grep brqd0084ac0-f7 01:00:5e:00:00:01 dev brqd0084ac0-f7 self permanent fe:16:3e:02:62:18 dev tap74af38f9-2e master brqd0084ac0-f7 permanent fe:16:3e:02:62:18 dev tap74af38f9-2e vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:86:9c:dd dev tapb00b3c18-b3 master brqd0084ac0-f7 permanent fe:16:3e:86:9c:dd dev tapb00b3c18-b3 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:dc:c3:3e dev tap7284d235-2b master brqd0084ac0-f7 permanent fe:16:3e:dc:c3:3e dev tap7284d235-2b vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:1d:d6:33 dev tapbeb9441a-99 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:1d:d6:33 dev tapbeb9441a-99 master brqd0084ac0-f7 permanent 24:be:05:a3:1f:e1 dev eno1.102 vlan 1 master brqd0084ac0-f7 permanent 24:be:05:a3:1f:e1 dev eno1.102 master brqd0084ac0-f7 permanent fe:16:3e:91:9b:45 dev tapc8ad2cec-90 master brqd0084ac0-f7 permanent fe:16:3e:91:9b:45 dev tapc8ad2cec-90 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:07:65:47 dev tap86e2c412-24 master brqd0084ac0-f7 permanent fe:16:3e:07:65:47 dev tap86e2c412-24 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:3c:42:64 dev tap37bcb70e-9e master brqd0084ac0-f7 permanent fe:16:3e:3c:42:64 dev tap37bcb70e-9e vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d master brqd0084ac0-f7 permanent fe:16:3e:b3:30:00 dev tap6548bacb-c0 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:b3:30:00 dev tap6548bacb-c0 master brqd0084ac0-f7 permanent fe:16:3e:5c:a6:6c dev
[Yahoo-eng-team] [Bug 1837252] Re: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges
** Also affects: os-vif/stein Importance: Undecided Status: New ** Also affects: os-vif/trunk Importance: High Assignee: sean mooney (sean-k-mooney) Status: In Progress ** Changed in: os-vif/stein Status: New => Confirmed ** Changed in: os-vif/stein Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: os-vif/stein Importance: Undecided => High -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1837252 Title: IFLA_BR_AGEING_TIME of 0 causes flooding across bridges Status in neutron: Invalid Status in OpenStack Compute (nova): Invalid Status in os-vif: In Progress Status in os-vif stein series: Confirmed Status in os-vif trunk series: In Progress Status in OpenStack Security Advisory: Incomplete Bug description: Release: OpenStack Stein Driver: LinuxBridge Using Stein w/ the LinuxBridge mech driver/agent, we have found that traffic is being flooded across bridges. Using tcpdump inside an instance, you can see unicast traffic for other instances. We have confirmed the macs table shows the aging timer set to 0 for permanent entries, and the bridge is NOT learning new MACs: root@lab-compute01:~# brctl showmacs brqd0084ac0-f7 port no mac addris local? ageing timer 5 24:be:05:a3:1f:e1 yes0.00 5 24:be:05:a3:1f:e1 yes0.00 1 fe:16:3e:02:62:18 yes0.00 1 fe:16:3e:02:62:18 yes0.00 7 fe:16:3e:07:65:47 yes0.00 7 fe:16:3e:07:65:47 yes0.00 4 fe:16:3e:1d:d6:33 yes0.00 4 fe:16:3e:1d:d6:33 yes0.00 9 fe:16:3e:2b:2f:f0 yes0.00 9 fe:16:3e:2b:2f:f0 yes0.00 8 fe:16:3e:3c:42:64 yes0.00 8 fe:16:3e:3c:42:64 yes0.00 10 fe:16:3e:5c:a6:6c yes0.00 10 fe:16:3e:5c:a6:6c yes0.00 2 fe:16:3e:86:9c:dd yes0.00 2 fe:16:3e:86:9c:dd yes0.00 6 fe:16:3e:91:9b:45 yes0.00 6 fe:16:3e:91:9b:45 yes0.00 11 fe:16:3e:b3:30:00 yes0.00 11 fe:16:3e:b3:30:00 yes0.00 3 fe:16:3e:dc:c3:3e yes0.00 3 fe:16:3e:dc:c3:3e yes0.00 root@lab-compute01:~# bridge fdb show | grep brqd0084ac0-f7 01:00:5e:00:00:01 dev brqd0084ac0-f7 self permanent fe:16:3e:02:62:18 dev tap74af38f9-2e master brqd0084ac0-f7 permanent fe:16:3e:02:62:18 dev tap74af38f9-2e vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:86:9c:dd dev tapb00b3c18-b3 master brqd0084ac0-f7 permanent fe:16:3e:86:9c:dd dev tapb00b3c18-b3 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:dc:c3:3e dev tap7284d235-2b master brqd0084ac0-f7 permanent fe:16:3e:dc:c3:3e dev tap7284d235-2b vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:1d:d6:33 dev tapbeb9441a-99 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:1d:d6:33 dev tapbeb9441a-99 master brqd0084ac0-f7 permanent 24:be:05:a3:1f:e1 dev eno1.102 vlan 1 master brqd0084ac0-f7 permanent 24:be:05:a3:1f:e1 dev eno1.102 master brqd0084ac0-f7 permanent fe:16:3e:91:9b:45 dev tapc8ad2cec-90 master brqd0084ac0-f7 permanent fe:16:3e:91:9b:45 dev tapc8ad2cec-90 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:07:65:47 dev tap86e2c412-24 master brqd0084ac0-f7 permanent fe:16:3e:07:65:47 dev tap86e2c412-24 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:3c:42:64 dev tap37bcb70e-9e master brqd0084ac0-f7 permanent fe:16:3e:3c:42:64 dev tap37bcb70e-9e vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d master brqd0084ac0-f7 permanent fe:16:3e:b3:30:00 dev tap6548bacb-c0 vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:b3:30:00 dev tap6548bacb-c0 master brqd0084ac0-f7 permanent fe:16:3e:5c:a6:6c dev tap61107236-1e vlan 1 master brqd0084ac0-f7 permanent fe:16:3e:5c:a6:6c dev tap61107236-1e master brqd0084ac0-f7 permanent The ageing time for the bridge is set to 0: root@lab-compute01:~# brctl showstp brqd0084ac0-f7 brqd0084ac0-f7 bridge id8000.24be05a31fe1 designated root 8000.24be05a31fe1 root port 0path cost 0 max age20.00 bridge max age20.00 hello time 2.00 bridge hello time 2.00 forward delay 0.00 bridge forward delay 0.00 ageing time 0.00 hello timer
[Yahoo-eng-team] [Bug 1825584] Re: eventlet monkey-patching breaks AMQP heartbeat on uWSGI
** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/stein Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Changed in: nova/ussuri Status: New => Fix Released ** Changed in: nova/train Status: New => In Progress ** Changed in: nova/train Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/ussuri Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/stein Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/stein Status: New => In Progress ** Changed in: nova Importance: Undecided => Low ** Changed in: nova/stein Importance: Undecided => Low ** Changed in: nova/train Importance: Undecided => Low ** Changed in: nova/ussuri Importance: Undecided => Low -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1825584 Title: eventlet monkey-patching breaks AMQP heartbeat on uWSGI Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) stein series: In Progress Status in OpenStack Compute (nova) train series: In Progress Status in OpenStack Compute (nova) ussuri series: Fix Released Bug description: Stein nova-api running under uWSGI presents an AMQP issue. The first API call that requires RPC creates an AMQP connection and successfully completes. Normally regular heartbeats would be sent from this point on, to maintain the connection. This is not happening. After a few minutes, the AMQP server (rabbitmq, in my case) notices that there have been no heartbeats, and drops the connection. A later nova API call that requires RPC tries to use the old connection, and throws a "connection reset by peer" exception and the API call fails. A mailing-list response suggests that this is affecting mod_wsgi also: http://lists.openstack.org/pipermail/openstack- discuss/2019-April/005310.html I've discovered that this problem seems to be caused by eventlet monkey-patching, which was introduced in: https://github.com/openstack/nova/commit/23ba1c690652832c655d57476630f02c268c87ae It was later rearranged in: https://github.com/openstack/nova/commit/3c5e2b0e9fac985294a949852bb8c83d4ed77e04 but this problem remains. If I comment out the import of nova.monkey_patch in nova/api/openstack/__init__.py the problem goes away. Seems that eventlet monkey-patching and uWSGI are not getting along for some reason... To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1825584/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1879878] Re: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node
http://paste.openstack.org/show/795679/ i was able to repoduce this once on master but not reliably yet. so im moving this to confimed. we also have a downstream report of this on train https://bugzilla.redhat.com/show_bug.cgi?id=1850400 sill at that to the affeted versions i am setting the importance to medium as this seams to be quite hard to trigger as all but 1 out of the 10-12 attempts i made failed so i think this will be hit rarely. when this happens the vm is left in a running state runing on the target host. stopping the vm and starting it restores it to an active state. ** Bug watch added: Red Hat Bugzilla #1850400 https://bugzilla.redhat.com/show_bug.cgi?id=1850400 ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Status: Incomplete => Confirmed ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Changed in: nova/ussuri Status: New => Confirmed ** Changed in: nova/train Status: New => Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1879878 Title: VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node Status in OpenStack Compute (nova): Confirmed Status in OpenStack Compute (nova) train series: Confirmed Status in OpenStack Compute (nova) ussuri series: Confirmed Bug description: Description === In my environmet, it will take some time to clean VM on source node in confirming resize. during confirming resize process, periodic_task update_available_resource may update resource usage at the same time. It may cause ERROR like: CPUUnpinningInvalid: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set [] during confirming resize process. Steps to reproduce == * Set /etc/nova/nova.conf "update_resources_interval" to small value, let's say 30 seconds on compute nodes. This step will increase the probability of error. * create a "dedicated" VM, the flavor can be ++--+ | Property | Value| ++--+ | OS-FLV-DISABLED:disabled | False| | OS-FLV-EXT-DATA:ephemeral | 0| | disk | 80 | | extra_specs| {"hw:cpu_policy": "dedicated"} | | id | 2be0f830-c215-4018-a96a-bee3e60b5eb1 | | name | 4vcpu.4mem.80ssd.0eph.numa | | os-flavor-access:is_public | True | | ram| 4096 | | rxtx_factor| 1.0 | | swap | | | vcpus | 4| ++--+ * Resize the VM with a new flavor to another node. * Confirm resize. Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results. * Then you will see the ERROR notice on dashboard, And the VM become ERROR Expected result === VM resized successfuly, vm state is active Actual result = * VM become ERROR * On dashboard you can see this notice: Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []]. Environment === 1. Exact version of OpenStack you are running. Newton version with patch https://review.opendev.org/#/c/641806/21 I am sure it will happen to other new vision with https://review.opendev.org/#/c/641806/21 such as Train and Ussuri 2. Which hypervisor did you use? Libvirt + KVM 3. Which storage type did you use? local disk 4. Which networking type did you use? Neutron with OpenVSwitch Logs & Configs == ERROR log on source node 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [req-364606bb-9fa6-41db-a20e-6df9ff779934 b0887a73f3c1441686bf78944ee284d0 95262f1f45f14170b91cd8054bb36512 - - -] [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Setting instance vm_state to ERROR 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] Traceback (most recent call last): 2020-05-15 10:11:12.324 425843 ERROR nova.compute.manager [instance: 993138d6-4b80-4b19-81c1-a16dbc6e196c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6661, in _error_out_instance_on_
[Yahoo-eng-team] [Bug 1887377] [NEW] nova does not loadbalance asignmnet of resources on a host based on avaiablity of pci device, hugepages or pcpus.
Public bug reported: Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long time. since its introduction the advice has always been to create a flavor that mimic your typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes the you should create flavor that request 2 numa nodes. for along time operators have ignored this advice and continued to create singel numa node flavor sighting that after 5+ year of hardware venders working with VNF vendor to make there product numa aware, vnf often still do not optimize properly for a multi numa environment. as a result many operator still deploy single numa vms although that is becoming less common over time. when you deploy a vm with a single numa node today we more or less iterate over the host numa node in order and assign the vm to the first numa nodes where it fits. on a host without any pci devices whitelisted for openstack management this behvaior result in numa nodes being filled linerally form numa 0 to numa n. that mean if a host had 100G of hugepage on both numa node 0 and 1 and you schduled 101 1G singel numa vms to the host, 100 vm would spawn on numa0 and 1 vm would spwan on numa node 1. that means that the first 100 vms would all contened for cpu resouces on the first numa node while the last vm had all of the secound numa ndoe to its own use. the correct behavior woudl be for nova to round robin asign the vms attepmetin to keep the resouce avapiableity blanced. this will maxiumise performance for indivigual vms while pessimisng the schduling of large vms on a host. to this end a new numa blancing config option (unset, pack or spread) should be added and we should sort numa nodes in decending(spread) or acending(pack) order based on pMEM, pCPUs, mempages and pci devices in that sequence. in future release when numa is in placment this sorting will need to be done in a weigher that sorts the allocation caindiates based on the same pack/spread cirtira. i am filing this as a bug not a feature as this will have a significant impact for existing deployment that either expected https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented /reserve-numa-with-pci.html to implement this logic already or who do not follow our existing guidance on creating flavor that align to the host topology. ** Affects: nova Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New ** Tags: numa -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1887377 Title: nova does not loadbalance asignmnet of resources on a host based on avaiablity of pci device, hugepages or pcpus. Status in OpenStack Compute (nova): New Bug description: Nova has supported hugpages, cpu pinning and pci numa affintiy for a very long time. since its introduction the advice has always been to create a flavor that mimic your typeical hardware toplogy. i.e. if all your compute host have 2 numa nodes the you should create flavor that request 2 numa nodes. for along time operators have ignored this advice and continued to create singel numa node flavor sighting that after 5+ year of hardware venders working with VNF vendor to make there product numa aware, vnf often still do not optimize properly for a multi numa environment. as a result many operator still deploy single numa vms although that is becoming less common over time. when you deploy a vm with a single numa node today we more or less iterate over the host numa node in order and assign the vm to the first numa nodes where it fits. on a host without any pci devices whitelisted for openstack management this behvaior result in numa nodes being filled linerally form numa 0 to numa n. that mean if a host had 100G of hugepage on both numa node 0 and 1 and you schduled 101 1G singel numa vms to the host, 100 vm would spawn on numa0 and 1 vm would spwan on numa node 1. that means that the first 100 vms would all contened for cpu resouces on the first numa node while the last vm had all of the secound numa ndoe to its own use. the correct behavior woudl be for nova to round robin asign the vms attepmetin to keep the resouce avapiableity blanced. this will maxiumise performance for indivigual vms while pessimisng the schduling of large vms on a host. to this end a new numa blancing config option (unset, pack or spread) should be added and we should sort numa nodes in decending(spread) or acending(pack) order based on pMEM, pCPUs, mempages and pci devices in that sequence. in future release when numa is in placment this sorting will need to be done in a weigher that sorts the allocation caindiates based on the same pack/spread cirtira. i am filing this as a bug not a feature as this will have a significant impact for existing deployment that either
[Yahoo-eng-team] [Bug 1885558] Re: sriov: instance with macvtap vnic_type live migration failed
** Changed in: nova Importance: Undecided => Medium ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Changed in: nova/train Importance: Undecided => Medium ** Changed in: nova/train Status: New => Triaged ** Changed in: nova/ussuri Importance: Undecided => Medium ** Changed in: nova/ussuri Status: New => Triaged ** Tags added: live-migration pci -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1885558 Title: sriov: instance with macvtap vnic_type live migration failed Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Bug description: Instance with the vnic_type macvtap port live migration failed. My env configuration follow the document: https://docs.openstack.org/neutron/latest/admin/config-sriov.html # VFs on source comptue 84:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 84:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 84:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) # VFs on dest compute 81:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) 81:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:11.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:11.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:11.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 81:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) # port create CLI openstack port create --network $net_id --vnic-type macvtap macvtap01 # boot instance with macvtap port ova boot --flavor 2C4G --nic port-id=$(neutron port-show macvtap01 -f value -c id) --block-device source=image,id=${image},dest=volume,size=100,shutdown=preserve,bootindex=0 --availability-zone nova:ctrl01.srvio.dev vm01_ctrl_macvtap # live migration failed nova live-migration vm01_ctrl_macvtap Source compute node log: /var/log/nova-compute.log 2020-06-28 03:51:52.446 806489 DEBUG nova.virt.libvirt.vif [-] vif_type=hw_veb instance=Instance(access_ip_v4=None,access_ip_v6=None,architecture=None,auto_disk_config=False,availability_zone='nova',cell_name=None,cleaned=False,config_drive='True',created_at=2020-06-23T03:35:55Z,default_ephemeral_device=None,default_swap_device=None,deleted=False,deleted_at=None,device_metadata=,disable_terminate=False,display_description=None,display_name='vm01_ctrl_macvtap',ec2_ids=,ephemeral_gb=0,ephemeral_key_uuid=None,fault=,flavor=Flavor(1),host='ctrl01.srvio.dev',hostname='vm01-ctrl-macvtap',id=91,image_ref='',info_cache=InstanceInfoCache,instance_type_id=1,kernel_id='',key_data=None,key_name=None,keypairs=,launch_index=0,launched_at=2020-06-23T03:36:15Z,launched_on='ctrl01.srvio.dev',locked=False,locked_by=None,memory_mb=4096,metadata={},migration_context=,new_flavor=None,node='ctrl01.srvio.dev',numa_topology=None,old_flavor=None,os_type=None,pci_devices=,pci_requests=InstancePCIRequests,power_state=1,progress=0,project_id='b36d8472f55e4fe88f8af98fe2c0ad8c',ramdisk_id='',reservation_id='r-j7a6v3fv',root_device_name='/dev/vda',root_gb=50,security_groups=SecurityGroupList,services=,shutdown_terminate=False,system_metadata={boot_roles='reader,member,admin',image_base_image_ref='',image_container_format='bare',image_disk_format='raw',image_hw_qemu_guest_agent='yes',image_min_disk='50',image_min_ram='0',owner_project_name='admin',owner_user_name='admin'},tags=,task_state='migrating',terminat
[Yahoo-eng-team] [Bug 1889633] Re: Pinned instance with thread policy can consume VCPU
this has a signicant upgrade impact so i think this is imporant to fix and backport. i have repoduced this locally too so moveing to triaged. ** Changed in: nova Importance: Undecided => High ** Changed in: nova Status: New => Triaged ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Changed in: nova/train Importance: Undecided => High ** Changed in: nova/train Status: New => Triaged ** Changed in: nova/ussuri Importance: Undecided => High ** Changed in: nova/ussuri Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1889633 Title: Pinned instance with thread policy can consume VCPU Status in OpenStack Compute (nova): Triaged Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Bug description: In Train, we introduced the concept of the 'PCPU' resource type to track pinned instance CPU usage. The '[compute] cpu_dedicated_set' is used to indicate which host cores should be used by pinned instances and, once this config option was set, nova would start reporting 'PCPU' resource types in addition to (or entirely instead of, if 'cpu_shared_set' was unset) 'VCPU'. Requests for pinned instances (via the 'hw:cpu_policy=dedicated' flavor extra spec or equivalent image metadata property) would result in a query for 'PCPU' inventory rather than 'VCPU', as previously done. We anticipated some upgrade issues with this change, whereby there could be a period during an upgrade in which some hosts would have the new configuration, meaning they'd be reporting PCPU, but the remainder would still be on legacy config and therefore would continue reporting just VCPU. An instance could be reasonably expected to land on any host, but since only the hosts with the new configuration were reporting 'PCPU' inventory and the 'hw:cpu_policy=dedicated' extra spec was resulting in a request for 'PCPU', the hosts with legacy configuration would never be consumed. We worked around this issue by adding support for a fallback placement query, enabled by default, which would make a second request using 'VCPU' inventory instead of 'PCPU'. The idea behind this was that the hosts with 'PCPU' inventory would be preferred, meaning we'd only try the 'VCPU' allocation if the preferred path failed. Crucially, we anticipated that if a host with new style configuration was picked up by this second 'VCPU' query, an instance would never actually be able to build there. This is because the new-style configuration would be reflected in the 'numa_topology' blob of the 'ComputeNode' object, specifically via the 'cpuset' (for cores allocated to 'VCPU') and 'pcpuset' (for cores allocated to 'PCPU') fields. With new-style configuration, both of these are set to unique values. If the scheduler had determined that there wasn't enough 'PCPU' inventory available for the instance, that would implicitly mean there weren't enough of the cores listed in the 'pcpuset' field still available. Turns out there's a gap in this thinking: thread policies. The 'isolate' CPU thread policy previously meant "give me a host with no hyperthreads, else a host with hyperthreads but mark the thread siblings of the cores used by the instance as reserved". This didn't translate to a new 'PCPU' world where we needed to know how many cores we were consuming up front before landing on the host. To work around this, we removed support for the latter case and instead relied on a trait, 'HW_CPU_HYPERTHEADING', to indicate whether a host had hyperthread support or not. Using the 'isolate' policy meant that trait could not be defined on the host, or the trait was "forbidden". The gap comes via a combination of this trait request and the fallback query. If we request the isolate thread policy, hosts with new-style configuration and sufficient PCPU inventory would nonetheless be rejected if they reported the 'HW_CPU_HYPERTHEADING' trait. However, these could get picked up in the fallback query and the instance would not fail to build on the host because of lack of 'PCPU' inventory. This means we end up with a pinned instance on a host using new-style configuration that is consuming 'VCPU' inventory. Boo. # Steps to reproduce 1. Using a host with hyperthreading support enabled, configure both '[compute] cpu_dedicated_set' and '[compute] cpu_shared_set' 2. Boot an instance with the 'hw:cpu_thread_policy=isolate' extra spec. # Expected result Instance should not boot since the host has hyperthreads. # Actual result Instance boots. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1889633/+
[Yahoo-eng-team] [Bug 1883671] Re: [SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info
reading the nic feature flags was intoduced in pike https://github.com/openstack/nova/commit/e6829f872aca03af6181557260637c8b601e476a but this only seams to happen on mondern version of libvirt so setting as wont fix. it can be backported if someone hits the issue and care to do so ** Also affects: nova/pike Importance: Undecided Status: New ** Also affects: nova/queens Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/stein Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/rocky Importance: Undecided Status: New ** Changed in: nova/pike Status: New => Won't Fix ** Changed in: nova/queens Status: New => Won't Fix ** Changed in: nova/rocky Status: New => Won't Fix ** Changed in: nova/stein Status: New => Triaged ** Changed in: nova/stein Status: Triaged => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1883671 Title: [SRIOV] When a VF is bound to a VM, Nova can't retrieve the PCI info Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) pike series: Won't Fix Status in OpenStack Compute (nova) queens series: Won't Fix Status in OpenStack Compute (nova) rocky series: Won't Fix Status in OpenStack Compute (nova) stein series: Won't Fix Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Bug description: Nova periodically updates the available resources per hypervisor [1]. That implies the reporting of the PCI devices [2]->[3]. In [4], a new feature was introduced to read from libvirt the NIC capabilities (gso, tso, tx, etc.). But when the NIC interface is bound to the VM and the MAC address is not the one assigned by the driver (Nova changes the MAC address according to the info provided by Neutron), libvirt fails reading the non-existing device: http://paste.openstack.org/show/794799/. This command should be avoided or at least, if the executing fails, the exception could be hidden. [1]https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9642 [2]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6980 [3]https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6898 [4]Ia5b6abbbf4e5f762e0df04167c32c6135781d305 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1883671/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1888395] Re: shared live migration of a vm with a vif is broken in train
** Also affects: networking-opencontrail Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1888395 Title: shared live migration of a vm with a vif is broken in train Status in networking-opencontrail: New Status in OpenStack Compute (nova): Incomplete Bug description: it was working in queens but fails in train. nova compute at the target aborts with the exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming res = self.dispatcher.dispatch(message) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch return self._do_dispatch(endpoint, method, ctxt, args) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch result = func(ctxt, **new_args) File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped function_name, call_dict, binary, tb) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped return f(self, context, *args, **kw) File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1372, in decorated_function return function(self, context, *args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 219, in decorated_function kwargs['instance'], e, sys.exc_info()) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 207, in decorated_function return function(self, context, *args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7007, in pre_live_migration bdm.save() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6972, in pre_live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9190, in pre_live_migration instance, network_info, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9071, in _pre_live_migration_plug_vifs vif_plug_nw_info.append(migrate_vif.get_dest_vif()) File "/usr/lib/python2.7/site-packages/nova/objects/migrate_data.py", line 90, in get_dest_vif vif['type'] = self.vif_type File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 67, in getter self.obj_load_attr(name) File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 603, in obj_load_attr _("Cannot load '%s' in the base class") % attrname) NotImplementedError: Cannot load 'vif_type' in the base class steps to reproduce: - train centos 7 based deployment: 1 controller, 2 computes, libvirt + qemu-kvm, ceph shared storage, neutron with contrail vrouter virtual network; - create and start a vm; - live migrate it between computes. expected result: vm migrates successfully. rpm -qa | grep nova: python2-novaclient-15.1.1-1.el7.noarch openstack-nova-common-20.3.0-1.el7.noarch python2-nova-20.3.0-1.el7.noarch openstack-nova-compute-20.3.0-1.el7.noarch To manage notifications about this bug go to: https://bugs.launchpad.net/networking-opencontrail/+bug/1888395/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1888395] Re: shared live migration of a vm with a vif is broken in train
moving this to triaaged and setting this to high the regression was introduced in train by https://opendev.org/openstack/nova/commit/fd8fdc934530fb49497bc6deaa72adfa51c8783a specifically https://github.com/openstack/nova/blob/b8ca3ce31ca15ddaa18512271c2de76835f908bb/nova/compute/manager.py#L7654-L7656 adding migrate_data.vifs = \ migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs( instance.get_network_info()) uncondtionally activates the code path that require multiple port bindings as when support for the multiple port bindings was added in rocky it used migrate_data.vif as a sentel for the new workflow. e.g. if it is populated the new migration workflow should be used. migrate_data.vifs = \ migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs( instance.get_network_info()) should be if self.network_api.supports_port_binding_extension(ctxt): migrate_data.vifs = migrate_data_obj.VIFMigrateData.create_skeleton_migrate_vifs(instance.get_network_info()) this bug prevents live migation with any neutron backend that does not support multiple port bindigns form train on so i am setting this to high. ** Changed in: nova Importance: Undecided => High ** Changed in: nova Status: Incomplete => Triaged ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Changed in: nova/train Status: New => Triaged ** Changed in: nova/train Importance: Undecided => High ** Changed in: nova/ussuri Status: New => Triaged ** Changed in: nova/ussuri Importance: Undecided => High -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1888395 Title: shared live migration of a vm with a vif is broken in train Status in networking-opencontrail: New Status in OpenStack Compute (nova): Triaged Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Bug description: it was working in queens but fails in train. nova compute at the target aborts with the exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming res = self.dispatcher.dispatch(message) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 274, in dispatch return self._do_dispatch(endpoint, method, ctxt, args) File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch result = func(ctxt, **new_args) File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped function_name, call_dict, binary, tb) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped return f(self, context, *args, **kw) File "/usr/lib/python2.7/site-packages/nova/compute/utils.py", line 1372, in decorated_function return function(self, context, *args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 219, in decorated_function kwargs['instance'], e, sys.exc_info()) File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 207, in decorated_function return function(self, context, *args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7007, in pre_live_migration bdm.save() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ self.force_reraise() File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise six.reraise(self.type_, self.value, self.tb) File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6972, in pre_live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9190, in pre_live_migration instance, network_info, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 9071, in _pre_live_migration_plug_vifs vif_plug_nw_info.append(migrate_vif.get_dest_vif()) File "/usr/lib/python2.7/site-packages/nova/objects/migrate_data.py", line 90, in get_dest_vif vif['type']
[Yahoo-eng-team] [Bug 1893121] [NEW] nova does not balance vm across numa node or prefer numa node with pci device when one is requested
Public bug reported: the current implementation of numa has evolved over the years to support pci affinity policyes and numa affinity for other deivces like pmem. when numa was first intoduced the recomenation was to match the virtual numa toplogy of a guest to the numa toplogy of the host for best performance. in such a configuration the guest cpus and memory are evenly distubuted across the host numa nodes meanin that the memroy contoler and phsyicall cpus are consumed evenly. i.e. all vms do not use the cores form oh host numa node. if you create a vm with only hw:numa_nodes set and no other numa requests however due to how we currently iterage over host numa cells in a deterministic order the all vms will be placeed on numa node 0. if other vms also request numa resource like pinned cpus hw:cpu_policy=dedicated or explict pages size hw:mem_page_size then the consumption of those resource will eventually cause those vms to loadblance onto the other numa nodes. as a reuslt the current behavior is to fill the first numa node before ever using resouces form the rest for numa vms using cpu pinnign or hugepages but numa vms that only request hw:numa_nodes wont be loadblanced. in both case this is suboptimal as it resulting in lower utilisation of the host hardware as the second and subsequent numa nodes will not be used untill the first numa node is full when using pinning and huge pages and will never be used for numa instance that dont request other numa resources. in a similar vain https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/reserve-numa-with-pci.html partly implemented a preferential sorting of host with pci devices. if the vm did not request a pci device we weighter host with pcidevice lowere vs host without them. https://github.com/openstack/nova/blob/20459e3e88cb8382d450c7fdb042e2016d5560c5/nova/virt/hardware.py#L2268-L2275 a full implemation would have on the selected host prefer putting the vms on numa nodes that had a pci device. as a result if a host has 2 numa nodes and the vm request a pci deice and 1 numa node. if the vm will fit on the first numa node (node0) and has the prefered police for pci affintiy we wont check or use the second numa node (node1). the fix for this is triviail add an else clause # If PCI device(s) are not required, prefer host cells that don't have # devices attached. Presence of a given numa_node in a PCI pool is # indicative of a PCI device being associated with that node if not pci_requests and pci_stats: # TODO(stephenfin): pci_stats can't be None here but mypy can't figure # that out for some reason host_cells = sorted(host_cells, key=lambda cell: cell.id in [ pool['numa_node'] for pool in pci_stats.pools]) # type: ignore becomes # If PCI device(s) are not required, prefer host cells that don't have # devices attached. Presence of a given numa_node in a PCI pool is # indicative of a PCI device being associated with that node if not pci_requests and pci_stats: # TODO(stephenfin): pci_stats can't be None here but mypy can't figure # that out for some reason host_cells = sorted(host_cells, key=lambda cell: cell.id in [ pool['numa_node'] for pool in pci_stats.pools]) # type: ignore else: host_cells = sorted(host_cells, key=lambda cell: cell.id in [ pool['numa_node'] for pool in pci_stats.pools], reverse=True) # type: ignore or more compactly # If PCI device(s) are not required, prefer host cells that don't have # devices attached. Presence of a given numa_node in a PCI pool is # indicative of a PCI device being associated with that node reverse = pci_requests and pci_stats: # TODO(stephenfin): pci_stats can't be None here but mypy can't figure # that out for some reason host_cells = sorted(host_cells, key=lambda cell: cell.id in [ pool['numa_node'] for pool in pci_stats.pools], reverse=reverse) # type: ignore since python support stable sort orders complex sort can be achcive by multiple stables sorts https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex- sorts so we can also adress the numa blanceing issue by first sorting by instance per numa node then sorting by free memory per numa node then by cpus per numa node and finally by pci device per numa node. this will allow nova to evenly distubtue vms optimally per numa node and also fully support the preference aspect of the preferred sriov numa affinity policy which currenlty only select a host that is capable of provideing numa affintiy but does not actully pferfer the numa node when we boot the vm. this bug applies to all currently supported release of nova. ** Affects: nova Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: Confirmed ** T
[Yahoo-eng-team] [Bug 1893148] [NEW] libvirt.libvirtError: Domain not found: no domain with matching uuid
Public bug reported: seen in upstream ci in grenade multi node job as part of a livemigation. in this canse the error happens on the destination host. Traceback (most recent call last): File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 605, in _get_domain return conn.lookupByUUIDString(instance.uuid) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 4151, in lookupByUUIDString if ret is None:raise libvirtError('virDomainLookupByUUIDString() failed', conn=self) libvirt.libvirtError: Domain not found: no domain with matching uuid '386113bf-cca1-438a-9ab5-4714c147bbfc' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/stack/old/nova/nova/compute/manager.py", line 8005, in _do_pre_live_migration_from_source instance, block_device_info=block_device_info) File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9934, in get_instance_disk_info self._get_instance_disk_info(instance, block_device_info)) File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9915, in _get_instance_disk_info guest = self._host.get_guest(instance) File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 589, in get_guest return libvirt_guest.Guest(self._get_domain(instance)) File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 609, in _get_domain raise exception.InstanceNotFound(instance_id=instance.uuid) nova.exception.InstanceNotFound: Instance 386113bf-cca1-438a- 9ab5-4714c147bbfc could not be found. this seam similar to https://bugs.launchpad.net/nova/+bug/1662626 but its not really clear why this fails see https://zuul.opendev.org/t/openstack/build/cfda29fa579544e481c803c4c5de51fb/log/logs/subnode-2/screen-n-cpu.txt#9697-9729 of https://zuul.opendev.org/t/openstack/build/cfda29fa579544e481c803c4c5de51fb/ for full logs ** Affects: nova Importance: Medium Status: Triaged ** Tags: libvirt live-migration -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1893148 Title: libvirt.libvirtError: Domain not found: no domain with matching uuid Status in OpenStack Compute (nova): Triaged Bug description: seen in upstream ci in grenade multi node job as part of a livemigation. in this canse the error happens on the destination host. Traceback (most recent call last): File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 605, in _get_domain return conn.lookupByUUIDString(instance.uuid) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise raise value File "/usr/local/lib/python3.6/dist-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/libvirt.py", line 4151, in lookupByUUIDString if ret is None:raise libvirtError('virDomainLookupByUUIDString() failed', conn=self) libvirt.libvirtError: Domain not found: no domain with matching uuid '386113bf-cca1-438a-9ab5-4714c147bbfc' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/stack/old/nova/nova/compute/manager.py", line 8005, in _do_pre_live_migration_from_source instance, block_device_info=block_device_info) File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9934, in get_instance_disk_info self._get_instance_disk_info(instance, block_device_info)) File "/opt/stack/old/nova/nova/virt/libvirt/driver.py", line 9915, in _get_instance_disk_info guest = self._host.get_guest(instance) File "/opt/stack/old/nova/nova/virt/libvirt/host.py", line 589, in get_guest return libvirt_guest.Guest(self._get_domain(instance)) File "/o
[Yahoo-eng-team] [Bug 1895063] Re: Allow rescue volume backed instance
This is a feature not a bug. there is already a blueprint open for thsi so markin this as invlaid. https://blueprints.launchpad.net/nova/+spec/volume-backed-server-rebuild ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1895063 Title: Allow rescue volume backed instance Status in OpenStack Compute (nova): Fix Released Bug description: Should we offer support for volume backed instance? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1895063/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1892361] Re: SRIOV instance gets type-PF interface, libvirt kvm fails
** Also affects: nova/queens Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/rocky Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/stein Importance: Undecided Status: New ** Changed in: nova/queens Status: New => Confirmed ** Changed in: nova/queens Importance: Undecided => Medium ** Changed in: nova/rocky Importance: Undecided => Medium ** Changed in: nova/rocky Status: New => Triaged ** Changed in: nova/queens Status: Confirmed => Triaged ** Changed in: nova/stein Importance: Undecided => Medium ** Changed in: nova/stein Status: New => Triaged ** Changed in: nova/train Importance: Undecided => Medium ** Changed in: nova/train Status: New => Triaged ** Changed in: nova/ussuri Importance: Undecided => Medium ** Changed in: nova/ussuri Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1892361 Title: SRIOV instance gets type-PF interface, libvirt kvm fails Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) queens series: Triaged Status in OpenStack Compute (nova) rocky series: Triaged Status in OpenStack Compute (nova) stein series: Triaged Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Bug description: When spawning an SR-IOV enabled instance on a newly deployed host, nova attempts to spawn it with an type-PF pci device. This fails with the below stack trace. After restarting neutron-sriov-agent and nova-compute services on the compute node and spawning an SR-IOV instance again, a type-VF pci device is selected, and instance spawning succeeds. Stack trace: 2020-08-20 08:29:09.558 7624 DEBUG oslo_messaging._drivers.amqpdriver [-] received reply msg_id: 6db8011e6ecd4fd0aaa53c8f89f08b1b __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:400 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [req-e3e49d07-24c6-4c62-916e-f830f70983a2 ddcfb3640535428798aa3c8545362bd4 dd99e7950a5b46b5b924ccd1720b6257 - 015e4fd7db304665ab5378caa691bb8b 015e4fd7db304665ab5378caa691bb8b] [insta nce: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Instance failed to spawn: libvirtError: unsupported configuration: Interface type hostdev is currently supported on SR-IOV Virtual Functions only 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] Traceback (most recent call last): 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2274, in _build_resources 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] yield resources 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2054, in _build_and_run_instance 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] block_device_info=block_device_info) 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3147, in spawn 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] destroy_disks_on_failure=True) 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5651, in _create_domain_and_network 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] destroy_disks_on_failure) 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__ 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] self.force_reraise() 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe88-4020-9a9e-f4c437c6de11] six.reraise(self.type_, self.value, self.tb) 2020-08-20 08:29:09.561 7624 ERROR nova.compute.manager [instance: 9498ea75-fe8
[Yahoo-eng-team] [Bug 1896226] Re: The vnics are disappearing in the vm
** Also affects: neutron Importance: Undecided Status: New ** Tags added: libvirt neutron ovs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1896226 Title: The vnics are disappearing in the vm Status in neutron: New Status in OpenStack Compute (nova): New Bug description: Hi, We have a rocky OSA setup of branch 18.1.9. When we create a vm from a particular image, the vm comes with two missing vnics inside of it , out of four, which we have provisioned to it from four dhcp tenant networks. The plugin is OVS and the firewall driver is conntrack. It was working earlier until we have noticed this issue. So, if we reboot the VM, then, the vnics appear for a short time and again after a few seconds, they are disappearing. The disappearing vnics inside the vm have a floating ip associated with one of it and hence the vm is becoming unpingable. Again if we reboot the vm , it is comping up for a short time and again vanishing. At the moment when the pinging of the vm stops, we are not noticing any messages in the neutron server logs for that port, except when we try to manually do a vm reboot: LOGS - 2020-09-18 08:41:50.197 17808 INFO neutron.wsgi [req-ff4ad70c-568d-4625-9bea-2a472351d00a dae4d1b704943b11cb10287e984f9367070915c18f7cadd48f915af92d4b4d03 35391b98793b4c09bf87c91006d123c2 - f7834cb0083b4f8f81184b6595b46b34 f7834cb0083b4f8f81184b6595b46b34] 172.29.236.183,172.29.236.21 "GET /v2.0/floatingips?tenant_id=35391b98793b4c09bf87c91006d123c2&port_id=095edf83-2d8d-494d-b820-ef7540aefa7c&port_id=0da3066c-a0f5-49df-a10a-20919595a5b8&port_id=298916ee-91a2-428f-86aa-c5ed5f034563&port_id=3e6d29fe-7bee-47d4-bc98-00d934fc5764&port_id=6d456cbe-425a-4d17-86f6-2b77ab88a42f&port_id=853adf00-92ac--8487-32a77e3efb66&port_id=86ec85de-609a-403d-b027-3097ac597e0c&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a&port_id=947c5610-b0f4-439a-abf8-51ba3dc8d212&port_id=b00f8eae-18fa-44c4-92e6-9ee75c7c599c&port_id=b8ba1a0c-ebd9-4278-bc83-9b60e1036f63&port_id=c6ce1a3c-b8b3-493d-b052-2810efacbf5e&port_id=c7facefa-565f-46be-8048-a333505ee177&port_id=d5e1fc5f-1a84-4366-bf36-548f2bdc0366 HTTP/1.1" status: 200 len: 4363 time: 0.0851929 2020-09-18 08:41:50.585 17814 INFO neutron.wsgi [req-349c0d49-9671-43a4-a541-f04facff2ee7 c42abde21dee4c848dc653df8ec429aa e02428b1700247b98ad1d563133f6174 - default default] 172.29.236.57,172.29.236.21 "GET /v2.0/floatingips?fixed_ip_address=172.16.1.19&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a HTTP/1.1" status: 200 len: 1042 time: 0.0781569 2020-09-18 08:41:59.951 17808 INFO neutron.wsgi [req-4bc06b83-cb44-469a-9ef1-0ce9a4fa0753 dae4d1b704943b11cb10287e984f9367070915c18f7cadd48f915af92d4b4d03 35391b98793b4c09bf87c91006d123c2 - f7834cb0083b4f8f81184b6595b46b34 f7834cb0083b4f8f81184b6595b46b34] 172.29.236.183,172.29.236.21 "GET /v2.0/floatingips?tenant_id=35391b98793b4c09bf87c91006d123c2&port_id=298916ee-91a2-428f-86aa-c5ed5f034563&port_id=6d456cbe-425a-4d17-86f6-2b77ab88a42f&port_id=86f1034d-837d-4e67-ad5e-63d9642a0b2a&port_id=947c5610-b0f4-439a-abf8-51ba3dc8d212 HTTP/1.1" status: 200 len: 1042 time: 0.0917962 2020-09-18 08:42:06.094 17817 DEBUG neutron.plugins.ml2.rpc [req-9cc5b8b2-a616-466d-9ed8-eae2b9b6056b - - - - -] Device 86f1034d-837d-4e67-ad5e-63d9642a0b2a up at agent ovs-agent-b7w update_device_up /openstack/venvs/neutron-18.1.9/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:256 2020-09-18 08:42:06.151 17817 DEBUG neutron.db.provisioning_blocks [req-9cc5b8b2-a616-466d-9ed8-eae2b9b6056b - - - - -] Provisioning complete for port 86f1034d-837d-4e67-ad5e-63d9642a0b2a triggered by entity L2. provisioning_complete /openstack/venvs/neutron-18.1.9/lib/python2.7/site-packages/neutron/db/provisioning_blocks.py:138 - The port we are talking about is "86f1034d-837d-4e67-ad5e- 63d9642a0b2a" in the above logs. 1.9/lib/python2.7/site-packages/neutron/notifiers/nova.py:242 2020-09-18 08:41:56.744 16567 DEBUG novaclient.v2.client [-] REQ: curl -g -i -X POST http://wtl-int.sandvine.cloud:8774/v2.1/os-server-external-events -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA1}c8de3bb0dae419214d99c02879f89bd3a6a4dd78" -H "X-OpenStack-Nova-API-Version: 2.1" -d '{"events": [{"status": "completed", "tag": "6d456cbe-425a-4d17-86f6-2b77ab88a42f", "name": "network-vif-unplugged", "server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}, {"status": "completed", "tag": "298916ee-91a2-428f-86aa-c5ed5f034563", "name": "network-vif-unplugged", "server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}, {"status": "completed", "tag": "86f1034d-837d-4e67-ad5e-63d9642a0b2a", "name": "network-vif-unplugged", "server_uuid": "713376a9-c354-4fb7-946c-e926c1cd9412"}]}' _http_
[Yahoo-eng-team] [Bug 1896463] Re: evacuation failed: Port update failed : Unable to correlate PCI slot
just adding the previous filed downstream redhat bug https://bugzilla.redhat.com/show_bug.cgi?id=1852110 this can happen in queens for context so when we root cause the issue and fix it it should like be backported to queens. tjere are other older bugs form newton that look similar related to unshelve so its posible that the same issue is affecting multiple move operations. ** Bug watch added: Red Hat Bugzilla #1852110 https://bugzilla.redhat.com/show_bug.cgi?id=1852110 ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/stein Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/queens Importance: Undecided Status: New ** Also affects: nova/victoria Importance: Low Assignee: Balazs Gibizer (balazs-gibizer) Status: Confirmed ** Also affects: nova/rocky Importance: Undecided Status: New ** Changed in: nova/ussuri Importance: Undecided => Low ** Changed in: nova/ussuri Status: New => Triaged ** Changed in: nova/train Importance: Undecided => Low ** Changed in: nova/train Status: New => Triaged ** Changed in: nova/stein Importance: Undecided => Low ** Changed in: nova/stein Status: New => Triaged ** Changed in: nova/rocky Importance: Undecided => Low ** Changed in: nova/rocky Status: New => Triaged ** Changed in: nova/queens Importance: Undecided => Low ** Changed in: nova/queens Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1896463 Title: evacuation failed: Port update failed : Unable to correlate PCI slot Status in OpenStack Compute (nova): Confirmed Status in OpenStack Compute (nova) queens series: Triaged Status in OpenStack Compute (nova) rocky series: Triaged Status in OpenStack Compute (nova) stein series: Triaged Status in OpenStack Compute (nova) train series: Triaged Status in OpenStack Compute (nova) ussuri series: Triaged Status in OpenStack Compute (nova) victoria series: Confirmed Bug description: Description === if the _update_available_resource() of resource_tracker is called between _do_rebuild_instance_with_claim() and instance.save() when evacuating VM instances on destination host, nova/compute/manager.py 2931 def rebuild_instance(self, context, instance, orig_image_ref, image_ref, 2932 +-- 84 lines: injected_files, new_pass, orig_sys_metadata,--- 3016 claim_ctxt = rebuild_claim( 3017 context, instance, scheduled_node, 3018 limits=limits, image_meta=image_meta, 3019 migration=migration) 3020 self._do_rebuild_instance_with_claim( 3021 +-- 47 lines: claim_ctxt, context, instance, orig_image_ref,- 3068 instance.apply_migration_context() 3069 # NOTE (ndipanov): This save will now update the host and node 3070 # attributes making sure that next RT pass is consistent since 3071 # it will be based on the instance and not the migration DB 3072 # entry. 3073 instance.host = self.host 3074 instance.node = scheduled_node 3075 instance.save() 3076 instance.drop_migration_context() the instance is not handled as managed instance of the destination host because it is not updated on DB yet. 2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req- b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance 22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by this compute host but has allocations referencing this compute host: {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}. Skipping heal of allocation because we do not know what to do. And so the SRIOV ports (PCI device) was free by clean_usage() eventhough the VM has the VF port already. 743 def _update_available_resource(self, context, resources): 744 +-- 45 lines: # initialize the compute node object, creating it-- 789 self.pci_tracker.clean_usage(instances, migrations, orphans) 790 dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj() After that, evacuated this VM to another compute host again, we got the error like below. Steps to reproduce == 1. create a VM on com1 with SRIOV VF ports. 2. stop and disable nova-compute service on com1 3. wait 60 sec (nova-compute reporting interval) 4. evauate the VM to com2 5. wait the VM is activ
[Yahoo-eng-team] [Bug 1581977] Re: Invalid input for dns_name when spawning instance with .number at the end
personally i thihnk we whoudl clouse this as invlid. this is either a feature request to allow setting different hostnames form displayname as part of nova booth or a request to expand the allowed set of vm names to allow '.' which currently not allowed and transfrom it to some other value to generate a vlaid hostname. this hasnever been supported and is a well know requirement of the nova api that the vm name has to be a vlaid hostname meaning it may not contian a . so i dont think this is a vaild bug. we coudl impove documentaion around this or make the api stricter to reject the request eairler but anything beyond that would require a spec and an api microverion bump as it would be a new feature. given the agent of this bug im going to update the tragie status ** Changed in: nova Importance: Low => Wishlist ** Changed in: nova Status: Triaged => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1581977 Title: Invalid input for dns_name when spawning instance with .number at the end Status in OpenStack Compute (nova): Opinion Bug description: When attempting to deploy an instance with a name which ends in dot (e.g. .123, as in an all-numeric TLD) or simply a name that, after conversion to dns_name, ends as ., nova conductor fails with the following error: 2016-05-15 13:15:04.824 ERROR nova.scheduler.utils [req-4ce865cd-e75b- 4de8-889a-ed7fc7fece18 admin demo] [instance: c4333432-f0f8-4413-82e8-7f12cdf3b5c8] Error from last host: silpixa00394065 (node silpixa00394065): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1926, in _do_build_and_run_instance\nfilter_properties)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2116, in _build_and_run_instance\ninstance_uuid=instance.uuid, reason=six.text_type(e))\n', u"RescheduledException: Build of instance c4333432-f0f8-4413-82e8-7f12cdf3b5c8 was re-scheduled: Invalid input for dns_name. Reason: 'networking-ovn-ubuntu-16.04' not a valid PQDN or FQDN. Reason: TLD '04' must not be all numeric.\nNeutron server returns request_ids: ['req-7317c3e3-2875-4073-8076-40e944845b69']\n"] This throws one instance of the infamous Horizon message: Error: No valid host was found. There are not enough hosts available. This issue was observed using stable/mitaka via DevStack (nova commit fb3f1706c68ea5b58f05ea810c6339f2449959de). In the above example, the instance name is "networking-ovn (Ubuntu 16.04)", which resulted in an attempted dns_name="networking-ovn- ubuntu-16.04", where the 04 was interpreted as a TLD and, consequently, an invalid TLD. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1581977/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1901707] Re: race condition on port binding vs instance being resumed for live-migrations
adding nova as there is a nova element that need to be fixed also. because nova was observing the network-vif-plugged event form the dhcp agent we were not filtinging our wait condition on live migrate to only wait for backend that had plugtime events. so once this is fixed by rodolfos patch it actully breaks live migration because we are waiting for an event that will never come until https://review.opendev.org/c/openstack/nova/+/602432 is merged. for backporting reasons i am working in a seperate trivial patch to only wait for backends that send plugtime event. that patch will be backported first allowing rodolfos patch to be backported before https://review.opendev.org/c/openstack/nova/+/602432 i have 1 unit test left to update in the plug time patch and then ill push it and reference this bug. ** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Status: New => Triaged ** Changed in: nova Importance: Undecided => High ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1901707 Title: race condition on port binding vs instance being resumed for live- migrations Status in neutron: In Progress Status in OpenStack Compute (nova): Triaged Bug description: This is a separation from the discussion in this bug https://bugs.launchpad.net/neutron/+bug/1815989 There comment https://bugs.launchpad.net/neutron/+bug/1815989/comments/52 goes through in detail the flow on a Train deployment using neutron 15.1.0 (controller) and 15.3.0 (compute) and nova 20.4.0 There is a race condition where nova live-migration will wait for neutron to send the network-vif-plugged event but when nova receives that event the live migration is faster than the OVS l2 agent can bind the port on the destination compute node. This causes the RARP frames sent out to update the switches ARP tables to fail causing the instance to be completely unaccessible after a live migration unless these RARP frames are sent again or traffic is initiated egress from the instance. See Sean's comments after for the view from the Nova side. The correct behavior should be that the port is ready for use when nova get's the external event, but maybe that is not possible from the neutron side, again see comments in the other bug. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1901707/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1909972] Re: a number of tests fail under ppc64el arch
im reopening this and marking it as triaged ppc64le has been supported with thrid party integration testing provide by the IBM PowerKVM CI on ppc64el for years here is an example test run https://oplab9.parqtec.unicamp.br/pub/ppc64el/openstack/nova/68/767368/1/check/tempest-dsvm-full-focal-py3/ef10362/ redhat also ships version fo nova for ppc64el in our downstream product https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.0/html/release_notes/chap-introduction#Content_Delivery_Network_CDN_Channels that use libvirt kvm on ppc64el starting with power8 in osp 13 i think and now support power9. stephen is correct that we have no 1st party ci that covers ppc64el but i think its still ok to fix the unit test, the are more likely to regress yes but perhaps we can work with infra to see if we can use qemu to emulate ppc or see if any of our providers have ppc avaiable. i know rackspace used to run a large amount of there cloud on ppc. ** Changed in: nova Status: Won't Fix => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1909972 Title: a number of tests fail under ppc64el arch Status in OpenStack Compute (nova): Triaged Bug description: Hi, As per this Debian bug entry: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=976954 a number of unit tests are failing under ppc64el arch. Please fix these or exclude the tests on this arch. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1909972/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1852437] Re: Allow ability to disable individual CPU features via `cpu_model_extra_flags`
setting this back to invalid as Matt Riedemann siad this is a feature not a bug fix. it is trcked as a blueprint https://blueprints.launchpad.net/nova/+spec/allow-disabling-cpu-flags and we shoudl use that to track it not this bug. ** Changed in: nova Status: Triaged => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1852437 Title: Allow ability to disable individual CPU features via `cpu_model_extra_flags` Status in OpenStack Compute (nova): Invalid Bug description: What? - When using a custom CPU model, Nova currently allows enabling individual CPU flags/features via the config attribute, `cpu_model_extra_flags`: [libvirt] cpu_mode=custom cpu_model=IvyBridge cpu_model_extra_flags="pcid,ssbd, md-clear" The above only lets you enable the CPU features. This RFE is to also allow _disabling_ individual CPU features. Why? --- A couple of reasons: - An Operator wants to generate a baseline CPU config (that facilates live migration) across his Compute node pool. However, a certain CPU flag is causing an inteolerable performance issue for their guest workloads. If the Operator isolated the problem to _that_ specific CPU flag, then she would like to disable the flag. - More importantly, a specific CPU flag might trigger a CPU vulnerability. In such a case, the mitigation for it could be to simply _disable_ the offending CPU flag. Allowing disabling of individual CPU flags via Nova would enable the above use cases. How? By allowing the notion of '+' / '-' to indicate whether to enable to disable a given CPU flag. E.g. if you specify the below in 'nova.conf' (on the Compute nodes): [libvirt] cpu_mode=custom cpu_model=IvyBridge cpu_model_extra_flags="+pcid,-mtrr,ssbd" Then, when you start an instance, Nova should generate the below XML: IvyBridge Intel Note that the requirement to specify '+' / '-' for individual flags should be optional. If neither is specified, then we should assume '+', and enable the feature (as shown above for the 'ssbd' flag). To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1852437/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1915055] Re: launched_at's reset when resizing/reverting and unshelving impacts "openstack usage show"
** Changed in: nova Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1915055 Title: launched_at's reset when resizing/reverting and unshelving impacts "openstack usage show" Status in OpenStack Compute (nova): Won't Fix Bug description: environment: devstack-master stacked Jan 28th 2021 The "openstack usage show" commands provide metrics related to run time of the instance, as seen below: +---+--+ | Field | Value| +---+--+ | CPU Hours | 260.68 | | Disk GB-Hours | 260.68 | | RAM MB-Hours | 66733.63 | | Servers | 3| +---+--+ The logic in [0] determines how those values are calculated. They are based on the launched_at and terminated_at fields. Some operations, such as resize and unshelve, reset the launched_at field. Therefore, for a given instance, the run time information is wiped, as if it had never run before. Steps to reproduce: 1. Create an instance. 2. Wait a few minutes, start monitoring usage with "watch openstack usage show --project admin" on a separate tab. 3. Either shelve and unshelve an instance, or resize the instance and revert the resize. 4. Notice how the "openstack usage show" statistics suddenly drops to a lower value and then continues to increase. Expected result: Statistics would not drop, should continue measuring. Some possible solutions: 1. Stop resetting the launched_at field 2. Change the field used for calculation at [0] to something else (maybe created_at?) [0] https://github.com/openstack/nova/blob/6c0ceda3659405149b7c0b5c283275ef0a896269/nova/api/openstack/compute/simple_tenant_usage.py#L74 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1915055/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1913641] Re: Incorrect Shelved_offloaded instance metrics on openstack usage show output
** Changed in: nova Status: In Progress => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1913641 Title: Incorrect Shelved_offloaded instance metrics on openstack usage show output Status in OpenStack Compute (nova): Won't Fix Bug description: env: bionic-ussuri and bionic-wallaby (devstack) When running "openstack usage show --project ", having only shelved_offloaded instances in the project, it continues to track metrics as if the instance was running, even though it is not. See output below: $ openstack server list +--+--+---++--+---+ | ID | Name | Status| Networks | Image| Flavor| +--+--+---++--+---+ | a8d3fbb6-1734-4e3f-81db-b1c42a462bf7 | ins1 | SHELVED_OFFLOADED | private=10.0.0.30, fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0 | cirros-0.5.1-x86_64-disk | cirros256 | +--+--+---++--+---+ $ openstack usage show --project admin Usage from 2020-12-31 to 2021-01-29 on project 1bfc9c13d7da4a4183c0b16cfa80020f: +---+---+ | Field | Value | +---+---+ | CPU Hours | 0.04 | | Disk GB-Hours | 0.04 | | RAM MB-Hours | 9.43 | | Servers | 1 | +---+---+ $ openstack server show ins1 +-+-+ | Field | Value | +-+-+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | | | OS-EXT-SRV-ATTR:host| None | | OS-EXT-SRV-ATTR:hypervisor_hostname | None | | OS-EXT-SRV-ATTR:instance_name | instance-0001 | | OS-EXT-STS:power_state | Shutdown | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | shelved_offloaded | | OS-SRV-USG:launched_at | 2021-01-28T19:33:34.00 | | OS-SRV-USG:terminated_at| None | | accessIPv4 | | | accessIPv6 | | | addresses | private=10.0.0.30, fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0 | | config_drive| | | created | 2021-01-28T19:33:25Z | | flavor | cirros256 (c1) | | hostId | | | id | a8d3fbb6-1734-4e3f-81db-b1c42a462bf7 | | image | cirros-0.5.1-x86_64-disk (9e09f573-99f7-4f7c-bf16-47d475320207) | | key_name| None | | name| ins1 | | project_id | 1bfc9c13d7da4a4183c0b16cfa80020f | | properties | | | security_groups | name='default' | | status | SHELVED_OFFLOADED | | updated
[Yahoo-eng-team] [Bug 1915255] Re: [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById
This is a real issue because the Cavium ThunderX hardware violates an assumtion we have with regards to PF having netdevs if VF do. we just need to re add this try excpet that was removed. https://review.opendev.org/c/openstack/nova/+/739131/12/nova/virt/libvirt/driver.py#b6957 it was orginally removed as we are only looking at the sub set of VFs that are nics but since the Cavium ThunderX does not assing a PF to all VFs per https://bugs.launchpad.net/charm-nova-compute/+bug/1771662 we need to catch the exception in this case as we did before. this means that minium bandwidth based QOS cannot be implemented on this hardware as we rely on the PF netdev name to correlate the bandwidth between nova and neutron but other functionality shoudl work. The only way to support min bandwith qos on thsi hardware would be to altere the nic driver or enhance nova/neutron to support using the PF pci address instead of the parent netdev name. ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Status: New => Triaged ** Also affects: nova/victoria Importance: Undecided Status: New ** Changed in: nova/victoria Status: New => Triaged ** Changed in: nova/victoria Importance: Undecided => Medium ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1915255 Title: [Victoria] nova-compute won't start on aarch64 - raises PciDeviceNotFoundById Status in OpenStack Compute (nova): Triaged Status in OpenStack Compute (nova) victoria series: Triaged Bug description: Description === When deploying OpenStack Victoria on Ubuntu 20.04 (Focal) on arm64/aarch64, nova-compute 22.0.1 fails to start with (nova- compute.log): -- Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 156, in get_ifname_by_pci_address dev_info = os.listdir(dev_path) FileNotFoundError: [Errno 2] No such file or directory: '/sys/bus/pci/devices/0002:01:00.1/physfn/net' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 9823, in _update_available_resource_for_node self.rt.update_available_resource(context, nodename, File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 880, in update_available_resource resources = self.driver.get_available_resource(nodename) File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8473, in get_available_resource data['pci_passthrough_devices'] = self._get_pci_passthrough_devices() File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in _get_pci_passthrough_devices pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7223, in pci_info = [self._get_pcidev_info(name, dev, net_devs) for name, dev File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7199, in _get_pcidev_info device.update(_get_device_type(cfgdev, address, dev, net_devs)) File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 7154, in _get_device_type parent_ifname = pci_utils.get_ifname_by_pci_address( File "/usr/lib/python3/dist-packages/nova/pci/utils.py", line 159, in get_ifname_by_pci_address raise exception.PciDeviceNotFoundById(id=pci_addr) nova.exception.PciDeviceNotFoundById: PCI device 0002:01:00.1 not found -- This results in an empty `openstack hypervisor list`. This does not happen with OpenStack Ussuri (nova-compute 21.1.0). We also haven't seen this on other architectures (yet?). This code actually appeared between Ussuri and Victoria, [0] i.e. the first version having it is 22.0.0. $ lspci | grep 0002:01:00.1 0002:01:00.1 Ethernet controller: Cavium, Inc. THUNDERX Network Interface Controller virtual function (rev 09) Indeed /sys/bus/pci/devices/0002:01:00.1/physfn/ doesn't contain `net` but I'm not sure if that's really a problem or if nova-compute should just catch the exception and move on? A similar issue in the past [1] shows that this might be an issue specific to the Cavium Thunder X NIC. Related issue: [2] Steps to reproduce == Install and run nova >= 22.0.0 on an aarch64 machine (with a Cavium Thunder X NIC if possible). I personally use Juju [3] for deploying an entire OpenStack Victoria setup to a lab: $ git clone https://github.com/openstack-charmers/openstack-bun
[Yahoo-eng-team] [Bug 1798904] Re: tenant isolation is bypassed if port admin-state-up=false
im going to move the os-vif bug to fixed released as https://github.com/openstack/os-vif/commit/d291213f1ea62f93008deef5224506fb5ea5ee0d fixes what can be fixed by os-vif alone. this was part of https://github.com/openstack/os-vif/releases/tag/1.13.0 i am going to leave the nova bug as is untill this has been tested end to end as i belive https://review.opendev.org/c/openstack/nova/+/602432 is still required for nova ** Changed in: os-vif Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1798904 Title: tenant isolation is bypassed if port admin-state-up=false Status in neutron: New Status in OpenStack Compute (nova): Confirmed Status in os-vif: Fix Released Status in OpenStack Security Advisory: Incomplete Bug description: This bug is a second variant of https://bugs.launchpad.net/neutron/+bug/1734320 The original bug which is now public, was limited to the case where a vm is live migrated resulting in a short window where the teant instance could recive vlan tag traffic on the destination node before the neutron ml2 agent wires up the port on the ovs bridge. Note that while the original bug implied that the vm was only able to easedrop on trafic it was also possible for the vm to send traffic to a different tenant network by creating a vlan subport which corresponded to vlan in use for tenant isolation on the br-int. The original bug was determined to be a result of the fact that during live migratrion if the vif-type was ovs and ovs_hybrid_plug=false the VIF was pluged to the ovs bridge by the hyperviors when the vm was started on the destination node instead of pre plugging it and waiting for neutron to signel it had completed wireing up the port before migrating the instance. Since live migration is a admin only operation unless intentionally change by the operator the scope of this inital vector was limited. The second vector to create a running vm with an untagged port does not require admin privalages. If a user creates a neutron port and sets the admin-state-up field to False openstack port create --disable --network < my network> and then either boots a vm with this port openstack server create --flavor --image --port or attaches the port to an existing vm openstack server add port This will similarly create a window where the port is attached to the guest but neutron has not yet wired up the interface. Note that this was repoted to me for queens with ml2/ovs and iptables firewall. i have not personnaly validated that how to recreate it but i intend to to reporduce this on master next week an report back. i belive there are a few way that this can be mitagated. the mitgations for the live migration variant will narrow the window in which this variant will be viable and in general may be suffient in the cases where the netruon agent is is running correctly. but a more complete fix would involve modifiaction to nova neutron and os-vif. from a neutron perspective we could extend the neturon port binidngs to container 2 addtion fields. ml2_driver_names: a orderd comma sperated list of the agents that bound this port. Note: this will be used by os-vif to determin if it should preferom adtion actions such as taging the port, or setting its tx/rx quese down to mitigate this issue. ml2_port_events a list of time port stats events are emitted by a ml2 driver or a enum. Note: currently ml2/ovs signals nova that it has completed wiring up the port only when the agent has configured the vswitch but odl send the notification when the port is bound in the ml2 driver before the vswtich is configured. to be able to use these more effectivly with in nova we need to be able to know if the event is sent only additionally change to os-vif and nova will be required to process this new info. on the nova side if we know that a backend will send a event when the port is wired up on the vswitch we may be able to make attach wait untll that has been done. if os-vif know the ovs plugin was been used with ml2/ovs and the ovs l2 agent it could also contionally wait for the interface to be tagged by neutron. this could be done via a config option however since the plugin is shared with sdn controllers that manage ovs such as odl, ovn, onos and dragon flow it would have to default to not waiting as these other backends do not use vlans for tenant isolation. similarly instad of waiting we could have os-vif apply a drop rule and vlan 4095 based on a config option. again this would have to default to false or insecure to not break sdn based deploymetns. if we combine one of the config options with the ml2_driver_names change
[Yahoo-eng-team] [Bug 1909120] Re: n-api should reject requests to detach a volume when the compute is down
updating this to fix release sicne https://review.opendev.org/c/openstack/nova/+/768352 is merged on master backports have been propsoed to u and v so i have add those and i also added train since i assume we want this in Train downstream? if this need to go back futher then feel free to add those too but i tred to pick a resonable set of branches. ** Also affects: nova/ussuri Importance: Undecided Status: New ** Also affects: nova/victoria Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Changed in: nova Status: Confirmed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1909120 Title: n-api should reject requests to detach a volume when the compute is down Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) train series: New Status in OpenStack Compute (nova) ussuri series: New Status in OpenStack Compute (nova) victoria series: New Bug description: Description === At present requests to detach volumes from instances on down computes are accepted by n-api but will never be acted upon if the n-cpu service hosting the instance is down. n-api should reject such requests with a simple 409 HTTP conflict. Steps to reproduce == * Attempt to detach a volume from an instance residing on a down compute. Expected result === Request rejected by n-api Actual result = Request accepted but never completed Environment === 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ Master 2. Which hypervisor did you use? (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) What's the version of that? libvirt + QEMU/KVM 2. Which storage type did you use? (For example: Ceph, LVM, GPFS, ...) What's the version of that? N/A 3. Which networking type did you use? (For example: nova-network, Neutron with OpenVSwitch, ...) N/A To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1909120/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1918419] Re: vCPU resource max_unit is hardcoded
in general i dont feel like this is a valid bug. it is perhaps a feature request which chould be acomplished by an extention to Provider.yaml to allow standard resouce class inventories to be updated by the operator. in general what you are asking for is intentionally not allowed. max_unit must be less then total to prevent oversubsrction of a singel allocation against istelf. e.g. if total was 4 and max_unit as 8 the we could not actully allocate 8 to a vm without the vm over subsribing against its self. this would be invalid there for changing max_unit in this way would be incorrect. the supported way to adress your current problem would be to resize your impacted vms before moving them perhaps to ones with 2 numa node e.g. hw:numa_nodes=2 hw:mem_page_size=small. note: hw:mem_page_size should always be set if you use hw:numa_nodes im going to mark this as invalid for now but we could discuss this at the PTG realisticaly though i dont see a clean way to resovle this while also keeping the vms alive. resize would work but the live requirement is what makes that unpalitable. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1918419 Title: vCPU resource max_unit is hardcoded Status in OpenStack Compute (nova): Invalid Bug description: Becasue the spectre/meltdown vulnerabilities (2018) we needed to disable SMT in all public facing compute nodes. As result the number of available cores was reduced by half. We had flavors available with 32vCPUs that couldn't be used anymore because placement max_unit for vCPUs is hardcoded to be the total number of cpus regardless the allocation_ratio. To me it's a sensible default but doesn't offer any flexibility for operators. See the IRC discussion at that time: http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html As conclusion, we informed the users that we couldn't offer those flavors anymore. The old VMs (that were created before disabling SMT) continued to run without any issue. So... after ~2 year I'm hitting again this problem :) These compute nodes need now to be retired and we are live migrating all the instances to the replacement hardware. When trying to live migrate these instances (vCPUs > max_unit) it fails, becasue the migration allocation can't be created against the source compute node. For the new hardware (dest_compute) the vCPUS < max_unit, so no issue for the new allocation. I'm working around this problem (to live migrate the instances), patching the code to have a higher max_unit for vCPUs in the compute nodes hosting these instances. I feel that this issue should be discussed again and consider the possibility to configure the max_unit value. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1918419/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2017358] Re: VM doesn't boot after qemu-img convert from VMDK to RAW/QCOW2
** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2017358 Title: VM doesn't boot after qemu-img convert from VMDK to RAW/QCOW2 Status in OpenStack Compute (nova): Invalid Bug description: I'm trying to migrate a Windows Server (2016/2019) VM from vSphere/VMWare to OpenStack (KVM-QEMU). I followed these instructions: https://platform9.com/docs/openstack/tutorials-migrate-windows-vm- vmware-kvm without success. After downloading the VMDK file vCenter/vSphere in an Ubuntu Server with GUI installed (a server used for this purpose), I used this command: ``` ~# qemu-img convert -O qcow2 win2016-copy-flat.vmdk win2016.qcow2 ~# qemu-img convert -O raw win2016-copy-flat.vmdk win2016.qcow2 ``` I tried with both formats, RAW and QCOW2, and after importing into my controller node that image with the next command: ``` ~# openstack image create --insecure --container-format bare "win2016-raw" --disk-format raw --file /tmp/win2016.qcow2 ~# openstack image create --insecure --container-format bare "win2016-qcow2" --disk-format qcow2 --file /tmp/win2016.qcow2 ``` Finally, I tested creating a new instance and I obtain this error message: Booting from Hard Disk... Boot failed: not a bootable disk No bootable device. (Exactly like this issue: https://github.com/cloudbase/windows-imaging-tools/issues/324) After googling a lot and a couple of days, I tried another way, to change the chipset of the image from i440fx to q35, also, enabling the boot menu and secure boot, like in this link: https://bugzilla.redhat.com/show_bug.cgi?id=1663212 following the documentation about the properties of images (https://docs.openstack.org/ocata/cli-reference/glance-property- keys.html). Then, my instance continues without booting, with a different message but with the same result, something link this: https://github.com/ipxe/pipxe/issues/14 and a similar screenshot of this thread https://forums.freebsd.org/threads/i-got-error-bdsdxe-failed-to-load-boot0001-when-i-boot-kali-linux-vm-via-uefi-firmware.82773/ Also, I explored the possibility of the partition table being corrupted and I tried to repair it with `gdisk` command; with the same result. So, which other way can I test? Context, I have my services of OpenStack running over a Ubuntu Servers cluster with 3 nodes and 1 controller, deployed with kolla-ansible over docker to have high availability, and CEPH as storage, configured with rbd (rados) to work with Glance/Cinder. I have tested different Windows Server editions from scratch, installing the S.O. locally with KVM and VirtManager, then uploading the QCOW2 disk to OpenStack, and works fine, and other Linux distributions. But this specific scenario migrating with Windows Server from vSphere to OpenStack crashes on that point, with the bootable device. Thank you for reading and for your time. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2017358/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2018318] Re: 'openstack server resize --flavor' should not migrate VMs to another AZ
** Changed in: nova Status: In Progress => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2018318 Title: 'openstack server resize --flavor' should not migrate VMs to another AZ Status in OpenStack Compute (nova): Invalid Bug description: Before I start, let me describe the agents involved in the process migration and/or resize flow of OpenStack (in this case, Nova component). These are the mapping and interpretation I created while troubleshooting the reported problem. - Nova-API: the agent responsible for receiving the HTTP requests (create/resize/migrate) from the OpenStack end-user. It does some basic validation, and then sends a message with the requested command via RPC call to other agents. - Nova-conductor: the agent responsible to "conduct/guide" the workflow. Nova-conductor will read the commands from the RPC queue and then process the request from Nova-API. It does some extra validation, and for every command (create/resize/migrate), it asks for the scheduler to define the target host for the operation (if the target host was not defined by the user). - Nova-scheduler: the agent responsible to "schedule" VMs on hosts. It defines where a VM must reside. It receives the "select host request", and processes the algorithms to determine where the VM can be allocated. Before applying the scheduling algorithms, it calls/queries the Placement system to get the possible hosts where VMs might be allocated. I mean, hosts that fit the requested parameters, such as being in a given Cell, availability zone (AZ), having available/free computing resources to support the VM. The call from Nova-scheduler to Placement is an HTTP request. - Placement: behaves as an inventory system. It tracks where resources are allocated, their characteristics, and providers (hosts/storage/network system) where resources are (can be) allocated. It also has some functions to return the possible hosts where a "request spec" can be fulfilled. - Nova: the agent responsible to execute/process the commands and implement actions in the hypervisor. Then, we have the following workflow from the different processes. - migrate: Nova API ->(via RPC call -- nova.conductor.manager.ComputeTaskManager.live_migrate_instance) Nova Conductor (loads request spec) -> (via RPC call) Nova scheduler -> (via HTTP) Placement -> (after the placement return) Nova scheduler executes the filtering of the hosts, based on active filters. - > (return for the other processes in conductor) -> (via RPC call) Nova to execute the migration. - resize: Nova API ->(via RPC call -- nova.conductor.manager.ComputeTaskManager.migrate_server -- _cold_migrate) Nova Conductor (loads request spec) -> (via RPC call) Nova scheduler -> (via HTTP) Placement -> (after the placement return) nova scheduler executes the filtering of the hosts, based on active filters - > (return for the other processes), in Nova conductor -> (RPC call) Nova to execute the cold migration and start the VM again with the new computing resource definition As a side note, this mapping also explains why the "resize" was not executing the CPU compatibility check that the "migration" is executing (this is something else that I was checking, but it is worth mentioning here). The resize is basically a cold migration to a new host, where a new flavor (definition of the VM) is applied; thus, it does not need to evaluate CPU feature set compatibility. The problem we are reporting happens with both "migrate" and "resize" operations. Therefore, I had to add some logs to see what was going on there (that whole process is/was "logless"). The issue happens because Placement always returns all hosts of the environment for a given VM being migrated (resize is a migration process); this only happens if the VM is deployed without defining its availability zone in the request spec. To be more precise, Nova-conductor in `nova.conductor.tasks.live_migrate.LiveMigrationTask._get_request_spec_for_select_destinations` (https://github.com/openstack/nova/blob/3d83bb3356e10355437851919e161f258cebf761/nova/conductor/tasks/live_migrate.py#L460) always uses the original request specification, used to deploy the VM, to find a new host to migrate it to. Therefore, if the VM is deployed to a specific AZ, it will always send this AZ to Placement (because the AZ is in the request spec), and Placement will filter out hosts that are not from that AZ. However, if the VM is deployed without defining the AZ, Nova will select a host (from an AZ) to deploy it (the VM), and when migrating the VM, Nova is not trying to find another host in the same AZ where the VM is already running. It is always behaving as a new deployment process to select the host. That raised a que
[Yahoo-eng-team] [Bug 2020215] [NEW] ml2/ovn refuses to bind port due to dead agent randomly in the nova-live-migrate ci job
Public bug reported: we have seen random failures of test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume] in the nova-live-migaration job with the following error Details: {'code': 400, 'message': 'Migration pre-check error: Binding failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check neutron logs for more information.'} looking at the neuton log we see May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb- ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Refusing to bind port e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent: May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for vnic_type normal using segments [{'id': '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 525, 'network_id': '745f0724-2779-4d60-845c-8f673d567d0d'}] and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too. May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]: DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis table for 10 seconds {{(pid=38857) run /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}} This looks like it might be related to https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e This modified the code to add some randomness due to https://bugs.launchpad.net/neutron/+bug/1991817 but that seams to negitivly impact the stability of the agent. to fix this i will propose a patch to change the interval form interval = randint(0, cfg.CONF.agent_down_time // 2) to interval = randint(0, cfg.CONF.agent_down_time // 3) to increase the likelihood that we send the heartbeat in time. when we are making calls to privsep and ovs the logs stop for multiple second while those operations are happening and if that happens the the wrong time i belive this leads to use missing the heartbeat interval. ** Affects: neutron Importance: Undecided Assignee: sean mooney (sean-k-mooney) Status: New ** Changed in: neutron Assignee: (unassigned) => sean mooney (sean-k-mooney) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2020215 Title: ml2/ovn refuses to bind port due to dead agent randomly in the nova- live-migrate ci job Status in neutron: New Bug description: we have seen random failures of test_volume_backed_live_migration[id-5071cf17-3004-4257-ae61-73a84e28badd,multinode,volume] in the nova-live-migaration job with the following error Details: {'code': 400, 'message': 'Migration pre-check error: Binding failed for port e3308a61-39ff-4064-abb2-76de0d2139dc, please check neutron logs for more information.'} looking at the neuton log we see May 09 00:10:26.714817 np0033982852 neutron-server[78010]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [req-25d762eb- ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Refusing to bind port e3308a61-39ff-4064-abb2-76de0d2139dc to dead agent: May 09 00:10:26.716243 np0033982852 neutron-server[78010]: ERROR neutron.plugins.ml2.managers [req-25d762eb-ffb1-45df-badb-6e02f89e0152 req-f0c9ff35-90a0-49e5-8005-93f3c2bb3ab4 service neutron] Failed to bind port e3308a61-39ff-4064-abb2-76de0d2139dc on host np0033982853 for vnic_type normal using segments [{'id': '1770965e-ddf9-4519-96b1-943912334f78', 'network_type': 'geneve', 'physical_network': None, 'segmentation_id': 525, 'network_id': '745f0724-2779-4d60-845c-8f673d567d0d'}] and the following in the neutorn-ovn-metadata-agent on the host where the VM is migrating too. May 09 00:10:23.765529 np0033982853 neutron-ovn-metadata-agent[38857]: DEBUG neutron.agent.ovn.metadata.agent [-] Delaying updating chassis table for 10 seconds {{(pid=38857) run /opt/stack/neutron/neutron/agent/ovn/metadata/agent.py:243}} This looks like it might be related to https://github.com/openstack/neutron/commit/628442aed7400251f12809a45605bd717f494c4e This modified the code to add some randomness due to https://bugs.launchpad.net/neutron/+bug/1991817 but that seams to negitivly impact the stability of the agent. to fix this i will propose a patch to change the interval form interval = randint(0,
[Yahoo-eng-team] [Bug 2020028] Re: evacuate an instance on non-shared storage succeeded and boot image is rebuilt
This is the expected behavior. evacuate of image-backed vms rebuilds the root disk because that is the expected behaviour in a cloud env where the instance root disk should not contain any valuable data. This is functioning precisely how the API was designed to work. in fact, the preservation fo disk for BFV or instance with shared storage is perhaps the more surprising aspect. evacuate should be assumed to be destructive unless you are using boot form volume. it may or may not be destructive depending on the storage configuration of the compute ndoes when used with non boot from volume instnaces. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2020028 Title: evacuate an instance on non-shared storage succeeded and boot image is rebuilt Status in OpenStack Compute (nova): Invalid Bug description: Description === evacuate an instance on non-shared storage succeeded and boot image is rebuilt Steps to reproduce == 1. Create a two compute nodes cluster without shared storage 2. boot a image backed virtual machine 3. shutdown down the compute node where vm is running 4. evacuate instance to another node Expected: evacuate failed Real: evacuate succeeded and boot image is rebuilt. Version === Using nova victoria version To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2020028/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2025813] Re: test_rebuild_volume_backed_server failing 100% on nova-lvm job
https://review.opendev.org/q/Ia198f712e2ad277743aed08e27e480208f463ac7 ** Also affects: nova/antelope Importance: Undecided Status: New ** Also affects: nova/zed Importance: Undecided Status: New ** Also affects: nova/yoga Importance: Undecided Status: New ** Changed in: nova/antelope Status: New => In Progress ** Changed in: nova/antelope Importance: Undecided => Critical ** Changed in: nova/antelope Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova/yoga Status: New => Triaged ** Changed in: nova/yoga Importance: Undecided => Critical ** Changed in: nova/zed Status: New => Triaged ** Changed in: nova/zed Importance: Undecided => Critical -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2025813 Title: test_rebuild_volume_backed_server failing 100% on nova-lvm job Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) antelope series: In Progress Status in OpenStack Compute (nova) yoga series: Triaged Status in OpenStack Compute (nova) zed series: Triaged Bug description: After the tempest patch was merged [1] nova-lvm job started to fail with the following error in test_rebuild_volume_backed_server: Traceback (most recent call last): File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper return f(*func_args, **func_kwargs) File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 868, in test_rebuild_volume_backed_server self.get_server_ip(server, validation_resources), File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in get_server_ip return compute.get_server_ip( File "/opt/stack/tempest/tempest/common/compute.py", line 76, in get_server_ip raise lib_exc.InvalidParam(invalid_param=msg) tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When validation.connect_method equals floating, validation_resources cannot be None As discussed on IRC with Sean [2], the SSH validation is mandatory now which is disabled in the job config [2]. [1] https://review.opendev.org/c/openstack/tempest/+/831018 [2] https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38 [3] https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2025813/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2025813] Re: test_rebuild_volume_backed_server failing 100% on nova-lvm job
this is a bug in devstack-plugin-ceph-multinode-tempest-py3 we need to backport https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/882987 ** Also affects: devstack-plugin-ceph Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2025813 Title: test_rebuild_volume_backed_server failing 100% on nova-lvm job Status in devstack-plugin-ceph: New Status in OpenStack Compute (nova): In Progress Status in OpenStack Compute (nova) antelope series: In Progress Status in OpenStack Compute (nova) yoga series: Triaged Status in OpenStack Compute (nova) zed series: Triaged Bug description: After the tempest patch was merged [1] nova-lvm job started to fail with the following error in test_rebuild_volume_backed_server: Traceback (most recent call last): File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper return f(*func_args, **func_kwargs) File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 868, in test_rebuild_volume_backed_server self.get_server_ip(server, validation_resources), File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in get_server_ip return compute.get_server_ip( File "/opt/stack/tempest/tempest/common/compute.py", line 76, in get_server_ip raise lib_exc.InvalidParam(invalid_param=msg) tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When validation.connect_method equals floating, validation_resources cannot be None As discussed on IRC with Sean [2], the SSH validation is mandatory now which is disabled in the job config [2]. [1] https://review.opendev.org/c/openstack/tempest/+/831018 [2] https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38 [3] https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267 To manage notifications about this bug go to: https://bugs.launchpad.net/devstack-plugin-ceph/+bug/2025813/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2028851] [NEW] Console output was empty in test_get_console_output_server_id_in_shutoff_status
Public bug reported: test_get_console_output_server_id_in_shutoff_status https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L713 is failing consistently in the nova-lvm job starting on July 24 with 132 failures in the last 3 days. https://tinyurl.com/kvcc9289 Traceback (most recent call last): File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 728, in test_get_console_output_server_id_in_shutoff_status self.wait_for(self._get_output) File "/opt/stack/tempest/tempest/api/compute/base.py", line 340, in wait_for condition() File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 213, in _get_output self.assertTrue(output, "Console output was empty.") File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue raise self.failureException(msg) AssertionError: '' is not true : Console output was empty. its not clear why this has started failing. it may be a regression or a latent race in the test that we are now failing. def test_get_console_output_server_id_in_shutoff_status(self): """Test getting console output for a server in SHUTOFF status Should be able to GET the console output for a given server_id in SHUTOFF status. """ # NOTE: SHUTOFF is irregular status. To avoid test instability, # one server is created only for this test without using # the server that was created in setUpClass. server = self.create_test_server(wait_until='ACTIVE') temp_server_id = server['id'] self.client.stop_server(temp_server_id) waiters.wait_for_server_status(self.client, temp_server_id, 'SHUTOFF') self.wait_for(self._get_output) the test does not wait for the VM to be sshable so its possible that we are shutting off the VM before it is fully booted and no output has been written to the console. this failure has happened on multiple providers but only in the nova-lvm job. the console behavior is unrelated to the storage backend but the lvm job i belive is using lvm on a loopback file so the storage performance is likely slower then raw/qcow. so perhaps the boot is taking longer and no output is being written. ** Affects: nova Importance: Undecided Status: New ** Tags: gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2028851 Title: Console output was empty in test_get_console_output_server_id_in_shutoff_status Status in OpenStack Compute (nova): New Bug description: test_get_console_output_server_id_in_shutoff_status https://github.com/openstack/tempest/blob/04cb0adc822ffea6c7bfccce8fa08b03739894b7/tempest/api/compute/servers/test_server_actions.py#L713 is failing consistently in the nova-lvm job starting on July 24 with 132 failures in the last 3 days. https://tinyurl.com/kvcc9289 Traceback (most recent call last): File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 728, in test_get_console_output_server_id_in_shutoff_status self.wait_for(self._get_output) File "/opt/stack/tempest/tempest/api/compute/base.py", line 340, in wait_for condition() File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 213, in _get_output self.assertTrue(output, "Console output was empty.") File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue raise self.failureException(msg) AssertionError: '' is not true : Console output was empty. its not clear why this has started failing. it may be a regression or a latent race in the test that we are now failing. def test_get_console_output_server_id_in_shutoff_status(self): """Test getting console output for a server in SHUTOFF status Should be able to GET the console output for a given server_id in SHUTOFF status. """ # NOTE: SHUTOFF is irregular status. To avoid test instability, # one server is created only for this test without using # the server that was created in setUpClass. server = self.create_test_server(wait_until='ACTIVE') temp_server_id = server['id'] self.client.stop_server(temp_server_id) waiters.wait_for_server_status(self.client, temp_server_id, 'SHUTOFF') self.wait_for(self._get_output) the test does not wait for the VM to be sshable so its possible that we are shutting off the VM before it is fully booted and no output has been written to the console. this failure has happened on multiple providers but only in the nova-lvm job. the console behavior is unrelated to the storage backend but the lvm job i belive is using lvm on a loopback fil
[Yahoo-eng-team] [Bug 2026831] Re: Table nova/pci_devices is not updated after removing attached SRIOV port
This is not a bug this is intentional bevhior added by https://github.com/openstack/nova/commit/26c41eccade6412f61f9a8721d853b545061adcc To address https://bugs.launchpad.net/nova/+bug/1633120 ** Changed in: nova Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2026831 Title: Table nova/pci_devices is not updated after removing attached SRIOV port Status in OpenStack Compute (nova): Won't Fix Bug description: Description === When I create an SRIOV port and attach it to an instance then Nova/pci_devices db table for the VF is correctly updated and status of the VF is changed from "available" to "allocated". If I detach the port from the instance, the VF's status is also correctly reverted back to "available". But in case the port is deleted before it is detached from the instance, the VF's status stays "allocated" (in the db Nova/pci_devices) and it makes this VF unusable. Steps to reproduce == 1) create an SRIOV port in Openstack (VNIC type = Direct) and attach it to a VM 2) delete the SRIOV port from Openstack without detaching it from the VM first Expected result === 1) VF detached from the VM 2) VF's status in database (Nova/pci_devices) changed to "available" Actual result = 1) VF detached from the VM 2) VF's status in database (Nova/pci_devices) IS NOT changed to "available", it stays "allocated" Environment === 1. Openstack version: Yoga rpm -qa | grep nova python3-novaclient-17.7.0-1.el8.noarch openstack-nova-conductor-25.2.0-1.el8.noarch python3-nova-25.2.0-1.el8.noarch openstack-nova-common-25.2.0-1.el8.noarch openstack-nova-scheduler-25.2.0-1.el8.noarch openstack-nova-api-25.2.0-1.el8.noarch openstack-nova-novncproxy-25.2.0-1.el8.noarch 2. Which hypervisor did you use? Libvirt + KVM What's the version of that? libvirt-7.6.0-6.el8.x86_64 qemu-kvm-6.0.0-33.el8.x86_64 2. Which storage type did you use? This issue is storage independent. 3. Which networking type did you use? Neutron + openvswitch + sriovnicswitch Logs & Configs == (hypervisor) nova-compute.log: Before: PciDevicePool(count=16,numa_node=0,product_id='XXX',tags={dev_type='type- VF',parent_ifname='XXX',physical_network='XXX',remote_managed='false'},vendor_id='XXX') After: PciDevicePool(count=15,numa_node=0,product_id='XXX',tags={dev_type='type- VF',parent_ifname='XXX',physical_network='XXX',remote_managed='false'},vendor_id='XXX') To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2026831/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1983863] Re: Can't log within tpool.execute
adding nova as the change to fix this is breaking our unit tests. https://review.opendev.org/c/openstack/nova/+/894538 corrects this setting as critical as this is blocking the bump of upper constratis to include oslo.log 5.3.0 i don't think there is any real-world impact beyond that. ** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Status: New => In Progress ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova Importance: Undecided => Critical -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1983863 Title: Can't log within tpool.execute Status in OpenStack Compute (nova): In Progress Status in oslo.log: Fix Released Bug description: There is a bug in eventlet where logging within a native thread can lead to a deadlock situation: https://github.com/eventlet/eventlet/issues/432 When encountered with this issue some projects in OpenStack using oslo.log, eg. Cinder, resolve them by removing any logging withing native threads. There is actually a better approach. The Swift team came up with a solution a long time ago, and it would be great if oslo.log could use this workaround automaticaly: https://opendev.org/openstack/swift/commit/69c715c505cf9e5df29dc1dff2fa1a4847471cb6 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1983863/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2039381] Re: Regarding Nova's inability to delete the Cinder volume for creating virtual machines (version Y)
reviewing the steps rene performed and the initial bug description this work flow is not supported nova has never supported attaching a volume to a guest via the cidner API and detaching it has been explicitly blocked due to the cve exposures so for nova i belive this is invalid. cinder likely should prevent normal user form creating attachments for a nova instance with the same mitigation as the detach case. creating a volume attachment for a nova instance should require a service token with the service role just as delete does. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2039381 Title: Regarding Nova's inability to delete the Cinder volume for creating virtual machines (version Y) Status in Cinder: Confirmed Status in OpenStack Compute (nova): Invalid Bug description: When creating a virtual machine in the dashboard, create a volume and choose to delete the virtual machine while also deleting the volume. When deleting the virtual machine, there is no normal uninstallation of the volume and the volume is not deleted. The relevant error logs are shown in the image, but the openstack CLI can delete its volume. The specific commands are as follows. CLI: source /etc/keystone/admin-openrc.sh (Verify password file) openstack volume set --detached 191e555c-3947-4928-be46-9f09e2190877(volumeID) openstack volume delete 191e555c-3947-4928-be46-9f09e2190877(volumeID) It seems that Nova is unable to interact with the Cinder API to delete(or detached) commands, but I am not very professional. I don't know if it's a bug? 此错误跟踪器适用于文档错误,请使用以下内容作为模板,并根据需要删除或添加字段。将 [ ] 转换为 [x] 以复选框: - [ ] 此文档以这种方式不准确:__ - [ ] 这是一个文档添加请求。 - [ ] 我对文档有一个修复程序,我可以粘贴到下面,包括示例:输入和输出。 如果您有故障排除或支持问题,请使用以下资源: - 邮件列表:https://lists.openstack.org - IRC:电讯局的「开放栈」频道 --- 发布: 25.2.2.dev1 在 2019-10-08 11:20:05 SHA: fd0d336ab5be71917ef9bd94dda51774a697eca8 来源: https://opendev.org/openstack/nova/src/doc/source/install/index.rst 网址: https://docs.openstack.org/nova/yoga/install/ To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/2039381/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2051108] Re: Support for the "bring your own keys" approach for Cinder
for cinder this would likely require a spec as its an api change to be able to pass the barbican secrete i belive. for nova this might be a specless blueprint if the changes were minor enough and we coudl capture the details in the cinder spec otherwisse we would need a spec for nova as well. in either case this is not a bug in the scope of nova so ill make the nova part as invild form a paper work prespective since this would be tracked as a nova blueprint in lancuchpad with or without a spec not as a bug. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2051108 Title: Support for the "bring your own keys" approach for Cinder Status in Cinder: New Status in OpenStack Compute (nova): Invalid Bug description: Description === Cinder currently lags support the API to create a volume with a predefined (e.g. already stored in Barbican) encryption key. This feature would be useful for use cases where end-users should be enabled to store keys later on used to encrypt volumes. Work flow would be as follow: 1. End user creates a new key and stores it in OpenStack Barbican 2. User requests a new volume with volume type "LUKS" and gives an "encryption_reference_key_id" (or just "key_id"). 3. Internally the key is copied (like in volume_utils.clone_encryption_key_()) and a new "encryption_key_id". To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/2051108/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2007968] Re: Flavors may not meet the image minimum requirement when resize
** Also affects: nova Importance: Undecided Status: New ** Changed in: horizon Status: New => Invalid ** Changed in: nova Status: New => Triaged ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Assignee: (unassigned) => zhou zhong (zhouzhongg) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2007968 Title: Flavors may not meet the image minimum requirement when resize Status in OpenStack Dashboard (Horizon): Invalid Status in OpenStack Compute (nova): Triaged Bug description: Description === When resize instance, the flavors returned may not meet the image minimum memory requirement, resizing instance ignores the minimum memory limit of the image, which may cause the resizing be successfully, but the instance fails to start because the memory is too small to run the system. Steps to reproduce == 1.create an instance with image min_ram 4096 2.resize the instance 3.watch the returned flavors Expected result === do not include the flavors which memory less than 4096. Actual result = returned all of the visible flavors. Environment === Logs & Configs == To manage notifications about this bug go to: https://bugs.launchpad.net/horizon/+bug/2007968/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2052718] Re: Nova Compute Service status goes up and down abnormally
I dont belive this is in the scope of nova to fix. the requirement to have consistent time synchronisation is well know and it strongly feels like a problem that should be address in an installation too not in code. we mention that the controllers should be rujing shared service like ntp in the docs https://docs.openstack.org/nova/latest/install/overview.html#controller if you have not ensured your clocks are in sync as part of the installation process via ntp, ptp or another method then i would not consider OpenStack to be correctly installed. ** Changed in: nova Status: New => Opinion ** Changed in: nova Importance: Undecided => Wishlist -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2052718 Title: Compute service status still up with nagative elapsed time Status in OpenStack Compute (nova): Opinion Bug description: Hi community, When you type: $ openstack nova compute service list The status you will see "up" status but actually it is running wrong logic because elapsed time is a negative number. This is caused by the abs(elapsed) function turning it into a positive integer. Around the abs(elapsed) line of code -> https://github.com/openstack/nova/blob/stable/2023.2/nova/servicegroup/drivers/db.py ... ... def is_up(self, service_ref): ... ... # Timestamps in DB are UTC. elapsed = timeutils.delta_seconds(last_heartbeat, timeutils.utcnow()) is_up = abs(elapsed) <= self.service_down_time if not is_up: LOG.debug('Seems service %(binary)s on host %(host)s is down. ' 'Last heartbeat was %(lhb)s. Elapsed time is %(el)s', {'binary': service_ref.get('binary'), 'host': service_ref.get('host'), 'lhb': str(last_heartbeat), 'el': str(elapsed)}) return is_up ... ... service_down_time (threshold): 60s https://github.com/openstack/nova/blob/stable/2023.2/nova/conf/service.py#L40 === Bad result === Example (1) bug: last_heartbeat: 10:00:00 AM now: 9:09:30 AM elapsed: -30(s) abs(-30s) < 60s ===> result: up Example (2) bug: last_heartbeat: 10:01:00 AM now: 9:09:58 AM elapsed: -62(s) abs(-30s) < 60s ===> result: down === Expected result === Example (1) good expectations: last_heartbeat: 10:00:00 AM now: 9:09:30 AM elapsed: -30(s) < 0 ===> result: logging error and down Example (2) good expectations: last_heartbeat: 10:01:00 AM now: 9:09:58 AM elapsed: -62(s) < 0 ===> result: logging error and down To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2052718/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2052937] Re: Policy: binding operations are prohibited for service role
nova has a job that was using a post hook for some extra sanity checks https://review.opendev.org/c/openstack/nova/+/909859 i have removed that but until that merges nova-next is blocked. ** Also affects: nova Importance: Undecided Status: New ** Changed in: nova Status: New => In Progress ** Changed in: nova Importance: Undecided => Critical ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Tags added: gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2052937 Title: Policy: binding operations are prohibited for service role Status in neutron: Fix Released Status in OpenStack Compute (nova): In Progress Bug description: Create/update port binding:* policies are admin only, which prevents for example ironic service user with service role to manage baremetal ports: "http://192.0.2.10:9292";, "region": "RegionOne"}], "id": "e6e42ef4fc984e71b575150e59a92704", "type": "image", "name": "glance"}]}} get_auth_ref /var/lib/kolla/venv/lib64/python3.9/site-packages/keystoneauth1/identity/v3/base.py:189 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron [None req-6737aef3-c823-4f7c-95ec-1c9f38b14faa a4dbb0dc59024c199843cea86603308b 9fd64a4cbd774756869cb3968de2e9b6 - - default default] Unable to clear binding profile for neutron port 291dbb7b-5cc8-480d-b39d-eb849bcb4a64. Error: ForbiddenException: 403: Client Error for url: http://192.0.2.10:9696/v2.0/ports/291dbb7b-5cc8-480d-b39d-eb849bcb4a64, ((rule:update_port and rule:update_port:binding:host_id) and rule:update_port:binding:profile) is disallowed by policy: openstack.exceptions.ForbiddenException: ForbiddenException: 403: Client Error for url: http://192.0.2.10:9696/v2.0/ports/291dbb7b-5cc8-480d-b39d-eb849bcb4a64, ((rule:update_port and rule:update_port:binding:host_id) and rule:update_port:binding:profile) is disallowed by policy 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron Traceback (most recent call last): 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/ironic/common/neutron.py", line 130, in unbind_neutron_port 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron update_neutron_port(context, port_id, attrs_unbind, client) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/ironic/common/neutron.py", line 109, in update_neutron_port 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return client.update_port(port_id, **attrs) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/network/v2/_proxy.py", line 2992, in update_port 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return self._update(_port.Port, port, if_revision=if_revision, **attrs) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/proxy.py", line 61, in check 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return method(self, expected, actual, *args, **kwargs) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/network/v2/_proxy.py", line 202, in _update 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return res.commit(self, base_path=base_path, if_revision=if_revision) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 1803, in commit 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron return self._commit( 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 1848, in _commit 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron self._translate_response(response, has_body=has_body) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/resource.py", line 1287, in _translate_response 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron exceptions.raise_from_response(response, error_message=error_message) 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron File "/var/lib/kolla/venv/lib64/python3.9/site-packages/openstack/exceptions.py", line 250, in raise_from_response 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron raise cls( 2024-02-12 11:44:57.848 7 ERROR ironic.common.neutron openstack.exceptions.ForbiddenException: ForbiddenException: 403: Client Error for url: htt
[Yahoo-eng-team] [Bug 2054797] Re: Unshelve can cause quota over-consumption
*** This bug is a duplicate of bug 2003991 *** https://bugs.launchpad.net/bugs/2003991 this sound like you have count_usage_from_placement=true https://docs.openstack.org/nova/latest/configuration/config.html#quota.count_usage_from_placement in which case this is not a bug and is the intended behavior thee was a bug related to this which i belvie was fixed recently it may or may not be backproted to yoga ** This bug has been marked a duplicate of bug 2003991 Quota not properly enforced during unshelve when [quota]count_usage_from_placement = True -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2054797 Title: Unshelve can cause quota over-consumption Status in OpenStack Compute (nova): New Bug description: Description === Unshelving a VM can cause an over consumption of a project's quota. I'm not sure if this is a bug or if it's actually intended behaviour, but in my opinion this should not be possible since this will allow users to potentially use a lot more resources than their intended quota. Steps to reproduce == * Create a project with a quota of i.e. 4 CPUs and 4GB of RAM * Create server1 with 2 CPUs and 2GB RAM, and shelve it after it successfully spawns * When server1 in shelved, create server2 with 4 CPUs and 4GB of RAM (effectively using up the entire CPU and RAM quota of the project) * Unshelve server1 Expected result === I would then expect that unshelving server1 would fail, since the quota was used up by server2 Actual result = Unshelving server1 is completed, and I have now used 6 of 4 CPUs and 6 of 4GB RAM on my project's quota. FWIW this also works if at the time of unshelving the quota is already used up. Environment === Openstack Yoga nova-api 3:25.1.1-0ubuntu1~cloud0 nova-scheduler 3:25.1.1-0ubuntu1~cloud0 Running KVM/libvirt on Ubuntu 20.04 and Ceph 17.x To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2054797/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2003991] Re: Quota not properly enforced during unshelve when [quota]count_usage_from_placement = True
https://review.opendev.org/q/topic:%22bug/2003991%22 note there were bugs backport filed back to train but those branches are now unsupproted. ** Also affects: nova/yoga Importance: Undecided Status: New ** Also affects: nova/xena Importance: Undecided Status: New ** Also affects: nova/zed Importance: Undecided Status: New ** Also affects: nova/antelope Importance: Undecided Status: New ** Also affects: nova/wallaby Importance: Undecided Status: New ** Also affects: nova/train Importance: Undecided Status: New ** Also affects: nova/victoria Importance: Undecided Status: New ** Also affects: nova/ussuri Importance: Undecided Status: New ** Changed in: nova/antelope Status: New => Fix Released ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova/antelope Importance: Undecided => Medium ** Changed in: nova/train Status: New => Won't Fix ** Changed in: nova/ussuri Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2003991 Title: Quota not properly enforced during unshelve when [quota]count_usage_from_placement = True Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) antelope series: Fix Released Status in OpenStack Compute (nova) train series: Won't Fix Status in OpenStack Compute (nova) ussuri series: Won't Fix Status in OpenStack Compute (nova) victoria series: New Status in OpenStack Compute (nova) wallaby series: New Status in OpenStack Compute (nova) xena series: New Status in OpenStack Compute (nova) yoga series: New Status in OpenStack Compute (nova) zed series: New Bug description: When nova is configured to count quota usage from placement [1], there are some behaviors that are different from the legacy quota resource counting. With legacy quotas, all of an instance's resources remained consumed from a quota perspective while the instance was SHELVED_OFFLOADED. Because of this, there was no need to check quota when doing an unshelve and an unshelve request could not be blocked for quota related reasons. The quota usage remained the same whether the instance was SHELVED_OFFLOADED or not. With counting quota usage from placement, cores and ram resource usage is counted from placement while instances are counted from the API database. And when an instance is SHELVED_OFFLOADED, it does not have any resource allocations in placement for cores and ram during that time. Because of this, it is possible to go over cores and ram quota after unshelving an instance as new resources will be allocated in placement for the unshelved instance. The unshelve quota scenario is currently not being properly enforced because there are no quota checks in the scheduling code path, so when the unshelving instance goes through the scheduling process, it is not validated against quota. There needs to be a dedicated quota check for unshelve. [1] https://docs.openstack.org/nova/latest/admin/quotas.html#quota- usage-from-placement To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2003991/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2055245] Re: DHCP Option is not passed to VM via Cloud-init
this is a neutron bug not a nova one. the behavior should not change between using the dhcp aganet and native dhcp ** Also affects: neutron Importance: Undecided Status: New ** Changed in: nova Status: In Progress => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2055245 Title: DHCP Option is not passed to VM via Cloud-init Status in neutron: New Status in OpenStack Compute (nova): Invalid Bug description: Description === Nova-Metadata-API doesn't provide ipv4_dhcp type for OVN (native OVH DHCP feature, no DHCP agents) networks with dhcp_enabled but no default gateway. Problem seems to be in https://opendev.org/openstack/nova/src/branch/master/nova/network/neutron.py#L3617 There is just an exception to networks without device_owner: network:dhcp where default gateway is used, which doesn't cover this case. Steps to reproduce == Create a OVN network in an environment where native DHCP feature is provided by ovn (no ml2/ovs DHCP Agents). In addition this network needs to have no default gateway enabled. Create VM in this network and observe the cloud-init process (network_data.json) Expected result === network_data.json (http://169.254.169.254/openstack/2018-08-27/network_data.json) should return something like: { "links": [ { "id": "tapddc91085-96", "vif_id": "ddc91085-9650-4b7b-ad9d-b475bac8ec8b", "type": "ovs", "mtu": 1442, "ethernet_mac_address": "fa:16:3e:93:49:fa" } ], "networks": [ { "id": "network0", "type": "ipv4_dhcp", "link": "tapddc91085-96", "network_id": "9f61a3a7-26d3-4013-b61d-12880b325ea9" } ], "services": [] } Actual result = { "links": [ { "id": "tapddc91085-96", "vif_id": "ddc91085-9650-4b7b-ad9d-b475bac8ec8b", "type": "ovs", "mtu": 1442, "ethernet_mac_address": "fa:16:3e:93:49:fa" } ], "networks": [ { "id": "network0", "type": "ipv4", "link": "tapddc91085-96", "ip_address": "10.0.0.40", "netmask": "255.255.255.0", "routes": [], "network_id": "9f61a3a7-26d3-4013-b61d-12880b325ea9", "services": [] } ], "services": [] } Environment === Openstack Zed with Neutron OVN feature enabled Nova: 26.2.1 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2055245/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2058928] [NEW] instance action events are b0rked AF
Public bug reported: Long version instance actions are ment to have a start and ending ideally one of Success or failure with the option to have intermediate events for complex operation like resize. For at least interface attach and detach that does not happen Possible all instance actions that are casts… We are sending notificaitons for atttach/detach start, end and failure referencing the instance action type in the notification. i.e. interface_attach.start interface_attach.end But we are not recording any finish events which means today there is no non-racy way to poll the detach action for its without resorting to instance show and partsing the address field to see the ip go away… note that that is also cached so that wont happen until the network info cache for the instance is updated and that only works for event that have visible sideffect observable on the instance object. We shoudl fix this for all instance actions and add functional test coverage and assert when they complete, with error or success, that we have actually updated the db with the event completion. Today in the functional tests we use the notifications or filed on the server object to know if it is complete but never check the instance action events in the db. We have test helper already to poll for the completion fo instance action events (IAEs) https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L173-L206 But we dont use them for volume detach for example https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238 because we dont complete the action by sending the event. We have test helpers for most of the instance actions in this file https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238 and they all either wait for state changes on the server or notificaiotns to know when the action has completed because we are missing the Code to complete the event… ** Affects: nova Importance: Undecided Status: New ** Tags: api compute -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2058928 Title: instance action events are b0rked AF Status in OpenStack Compute (nova): New Bug description: Long version instance actions are ment to have a start and ending ideally one of Success or failure with the option to have intermediate events for complex operation like resize. For at least interface attach and detach that does not happen Possible all instance actions that are casts… We are sending notificaitons for atttach/detach start, end and failure referencing the instance action type in the notification. i.e. interface_attach.start interface_attach.end But we are not recording any finish events which means today there is no non-racy way to poll the detach action for its without resorting to instance show and partsing the address field to see the ip go away… note that that is also cached so that wont happen until the network info cache for the instance is updated and that only works for event that have visible sideffect observable on the instance object. We shoudl fix this for all instance actions and add functional test coverage and assert when they complete, with error or success, that we have actually updated the db with the event completion. Today in the functional tests we use the notifications or filed on the server object to know if it is complete but never check the instance action events in the db. We have test helper already to poll for the completion fo instance action events (IAEs) https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L173-L206 But we dont use them for volume detach for example https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238 because we dont complete the action by sending the event. We have test helpers for most of the instance actions in this file https://github.com/openstack/nova/blob/master/nova/tests/functional/integrated_helpers.py#L223-L238 and they all either wait for state changes on the server or notificaiotns to know when the action has completed because we are missing the Code to complete the event… To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2058928/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2002400] Re: When adding ironic compute host to an aggregate, only one ironic compute node is added to placement aggregate
** Changed in: nova Status: In Progress => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2002400 Title: When adding ironic compute host to an aggregate, only one ironic compute node is added to placement aggregate Status in OpenStack Compute (nova): Invalid Bug description: The reason seems to be this line https://opendev.org/openstack/nova/src/commit/ba9d4c909beff4e9ab86911a35dd5db8d8ce08d6/nova/compute/api.py#L6646 nodes = objects.ComputeNodeList.get_all_by_host(context, host_name) node_name = nodes[0].hypervisor_hostname While OK for libvirt and such, this is not OK for compute services that manage many 'nodes/hypervisors' - e.g. ironic virt driver. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2002400/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1542491] Re: Scheduler update_aggregates race causes incorrect aggregate information
** Changed in: nova Status: Confirmed => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1542491 Title: Scheduler update_aggregates race causes incorrect aggregate information Status in OpenStack Compute (nova): Opinion Status in Ubuntu: Invalid Bug description: It appears that if nova-api receives simultaneous requests to add a server to a host aggregate, then a race occurs that can lead to nova- scheduler having incorrect aggregate information in memory. One observed effect of this is that sometimes nova-scheduler will think a smaller number of hosts are a member of the aggregate than is in the nova database and will filter out a host that should not be filtered. Restarting nova-scheduler fixes the issue, as it reloads the aggregate information on startup. Nova package versions: 1:2015.1.2-0ubuntu2~cloud0 Reproduce steps: Create a new os-aggregate and then populate an os-aggregate with simultaneous API POSTs, note timestamps: 2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server [req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 439 time: 0.1865470 2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server [req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.2995598 2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server [req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 519 time: 0.3140590 2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server [req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3759601 2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server [req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3433032 Schedule a VM Expected Result: nova-scheduler Availability Zone filter returns all members of the aggregate Actual Result: nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race: 2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70 2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84 2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62 2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62 2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv1, oshv1) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62 2016-02-05 07:48:04.413 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter AvailabilityZoneFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/dist-pack
[Yahoo-eng-team] [Bug 1542491] Re: Scheduler update_aggregates race causes incorrect aggregate information
setting this to medium severity there is an existing race in how the cache is updated. the workaround is to periodically restart the scheduled to clear the cache. this looks like it affects all stable releases of OpenStack. however its unlikely but not impossible that a fix for this can be backported. given the above I'm marking this as medium as there is a relatively simple workaround even if the detection of the isuee is not trivial. ** Changed in: nova Importance: Undecided => Medium ** Changed in: nova Status: Opinion => Triaged ** Changed in: nova Assignee: jingtao (liang888) => (unassigned) ** Tags added: api -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1542491 Title: Scheduler update_aggregates race causes incorrect aggregate information Status in OpenStack Compute (nova): Triaged Status in Ubuntu: Invalid Bug description: It appears that if nova-api receives simultaneous requests to add a server to a host aggregate, then a race occurs that can lead to nova- scheduler having incorrect aggregate information in memory. One observed effect of this is that sometimes nova-scheduler will think a smaller number of hosts are a member of the aggregate than is in the nova database and will filter out a host that should not be filtered. Restarting nova-scheduler fixes the issue, as it reloads the aggregate information on startup. Nova package versions: 1:2015.1.2-0ubuntu2~cloud0 Reproduce steps: Create a new os-aggregate and then populate an os-aggregate with simultaneous API POSTs, note timestamps: 2016-02-04 20:17:08.538 13648 INFO nova.osapi_compute.wsgi.server [req-d07a006e-134a-46d8-9815-6becec5b185c 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.3 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates HTTP/1.1" status: 200 len: 439 time: 0.1865470 2016-02-04 20:17:09.204 13648 INFO nova.osapi_compute.wsgi.server [req-a0402297-9337-46d6-96d2-066e230e45e1 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.2995598 2016-02-04 20:17:09.243 13648 INFO nova.osapi_compute.wsgi.server [req-0f543525-c34e-418a-91a9-894d714ee95b 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 519 time: 0.3140590 2016-02-04 20:17:09.273 13649 INFO nova.osapi_compute.wsgi.server [req-2f8d80b0-726f-4126-a8ab-a2eae3f1a385 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.2 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3759601 2016-02-04 20:17:09.275 13649 INFO nova.osapi_compute.wsgi.server [req-80ab6c86-e521-4bf0-ab67-4de9d0eccdd3 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] 10.120.13.1 "POST /v2.1/326d453c2bd440b4a7160489b632d0a8/os-aggregates/1/action HTTP/1.1" status: 200 len: 506 time: 0.3433032 Schedule a VM Expected Result: nova-scheduler Availability Zone filter returns all members of the aggregate Actual Result: nova-scheduler believes there is only one hypervisor in the aggregate. The number will vary as it is a race: 2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Starting with 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:70 2016-02-05 07:48:04.411 13600 DEBUG nova.filters [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Filter RetryFilter returned 4 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:84 2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv0, oshv0) ram:122691 disk:13404160 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62 2016-02-05 07:48:04.412 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c24338b5-a3b8-4864-8140-04ea6fbcf68f 41812fc01c6549ac8ed15c6dab05c670 326d453c2bd440b4a7160489b632d0a8 - - -] Availability Zone 'temp' requested. (oshv2, oshv2) ram:122691 disk:13403136 io_ops:0 instances:0 has AZs: nova host_passes /usr/lib/python2.7/dist-packages/nova/scheduler/filters/availability_zone_filter.py:62 2016-02-05 07:48:04.413 13600 DEBUG nova.scheduler.filters.availability_zone_filter [req-c
[Yahoo-eng-team] [Bug 2073862] Re: test_vmdk_bad_descriptor_mem_limit and test_vmdk_bad_descriptor_mem_limit_stream_optimized fail if qemu-img binary is missing
** Also affects: nova/bobcat Importance: Undecided Status: New ** Also affects: nova/antelope Importance: Undecided Status: New ** Also affects: nova/2024.1 Importance: Undecided Status: New ** Changed in: nova Importance: Undecided => Low ** Changed in: nova/antelope Importance: Undecided => Low ** Changed in: nova/antelope Status: New => Triaged ** Changed in: nova/2024.1 Status: New => Triaged ** Changed in: nova/bobcat Status: New => Triaged ** Changed in: nova/2024.1 Importance: Undecided => Low ** Changed in: nova/bobcat Importance: Undecided => Low -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2073862 Title: test_vmdk_bad_descriptor_mem_limit and test_vmdk_bad_descriptor_mem_limit_stream_optimized fail if qemu-img binary is missing Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) 2024.1 series: In Progress Status in OpenStack Compute (nova) antelope series: Triaged Status in OpenStack Compute (nova) bobcat series: Triaged Bug description: When qemu-img binary is not present on the system, these tests fail like we can see on these logs: == ERROR: nova.tests.unit.image.test_format_inspector.TestFormatInspectors.test_vmdk_bad_descriptor_mem_limit -- pythonlogging:'': {{{ 2024-07-23 11:44:54,011 WARNING [oslo_policy.policy] JSON formatted policy_file support is deprecated since Victoria release. You need to use YAML format which will be default in future. You can use ``oslopolicy-convert-json-to-yaml`` tool to convert existing JSON-formatted policy file to YAML-formatted in backward compatible way: https://docs.openstack.org/oslo.policy/latest/cli/oslopolicy-convert-json-to-yaml.html. 2024-07-23 11:44:54,012 WARNING [oslo_policy.policy] JSON formatted policy_file support is deprecated since Victoria release. You need to use YAML format which will be default in future. You can use ``oslopolicy-convert-json-to-yaml`` tool to convert existing JSON-formatted policy file to YAML-formatted in backward compatible way: https://docs.openstack.org/oslo.policy/latest/cli/oslopolicy-convert-json-to-yaml.html. 2024-07-23 11:44:54,015 WARNING [oslo_policy.policy] Policy Rules ['os_compute_api:extensions', 'os_compute_api:os-floating-ip-pools', 'os_compute_api:os-quota-sets:defaults', 'os_compute_api:os-availability-zone:list', 'os_compute_api:limits', 'project_member_api', 'project_reader_api', 'project_member_or_admin', 'project_reader_or_admin', 'os_compute_api:limits:other_project', 'os_compute_api:os-lock-server:unlock:unlock_override', 'os_compute_api:servers:create:zero_disk_flavor', 'compute:servers:resize:cross_cell', 'os_compute_api:os-shelve:unshelve_to_host'] specified in policy files are the same as the defaults provided by the service. You can remove these rules from policy files which will make maintenance easier. You can detect these redundant rules by ``oslopolicy-list-redundant`` tool also. }}} Traceback (most recent call last): File "/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py", line 408, in test_vmdk_bad_descriptor_mem_limit self._test_vmdk_bad_descriptor_mem_limit() File "/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py", line 382, in _test_vmdk_bad_descriptor_mem_limit img = self._create_allocated_vmdk(image_size // units.Mi, File "/home/jlejeune/dev/pci_repos/stash/nova/nova/tests/unit/image/test_format_inspector.py", line 183, in _create_allocated_vmdk subprocess.check_output( File "/usr/lib/python3.10/subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'qemu-img convert -f raw -O vmdk -o subformat=monolithicSparse -S 0 /tmp/tmpw0q0ibvj/nova-unittest-formatinspector--monolithicSparse-wz0i4kj1.raw /tmp/tmpw0q0ibvj/nova-unittest-formatinspector--monolithicSparse-qpo78jee.vmdk' returned non-zero exit status 127. == ERROR: nova.tests.unit.image.test_format_inspector.TestFormatInspectors.test_vmdk_bad_descriptor_mem_limit_stream_optimized -- pythonlogging:'': {{{ 2024-07-23 11:43:31,443 WARNING [oslo_policy.policy] JSON formatted policy_file support is deprecated since Victoria release. You need to use YAML format which will be default in future. You can use ``oslopolicy-convert-json
[Yahoo-eng-team] [Bug 2033401] Re: sanitize_hostname is not alligned with idna2 specification
Nova does not support internationalised hostnames so it does not support https://www.rfc-editor.org/rfc/rfc5891 the conversion of the display name to a hostname is the best effort and we make no guarantee of its validity for DNS. the conversion utility is intended to produce a valid hostname name but it not intended to ba a domain name nova could be enhanced to provide that functionality but i would be more inclined to remove the defautlign of the host name by converting the displayname and instead use the other fallback we already have which is to default to server- in a new API microversion. ** Changed in: nova Status: In Progress => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2033401 Title: sanitize_hostname is not alligned with idna2 specification Status in OpenStack Compute (nova): Opinion Bug description: DNSmask was switched to IDN2 specification more than 4 year ago Debian package [0] According to specification name with -- in 3rd and 4th characters is not allowed. See RFC 5891 [1] As result hostnames for example (rf--xx), generates error on DNSmasq side, and no longer works Aug 29 10:55:32 dnsmasq[243]: bad DHCP host name at line 2 of /var/lib/neutron/dhcp/6531ba54-0aa1-4b3b-b098-49bb0cfd586b/host cat /var/lib/neutron/dhcp/6531ba54-0aa1-4b3b-b098-49bb0cfd586b/host fa:16:3e:d9:ba:17,amphora-ccee6c76-e565-496d-b841-f485a99dc865.openstack.internal.,10.10.10.142 fa:16:3e:c8:93:56,re--test-database-7ezitojxojun-server-01-lrdygbkrxkho.openstack.internal.,10.10.10.209 fa:16:3e:29:dc:fc,host-10-10-10-45.openstack.internal.,10.10.10.45 fa:16:3e:1a:be:3f,host-10-10-10-103.openstack.internal.,10.10.10.103 fa:16:3e:bd:ab:2a,host-10-10-10-1.openstack.internal.,10.10.10.1 fa:16:3e:df:b7:c1,host-10-10-10-118.openstack.internal.,10.10.10.118 [0] https://github.com/imp/dnsmasq/commit/5a9133498562a0b69b287ad675ed3946803ea90c [1] https://www.rfc-editor.org/rfc/rfc5891#section-4.2.3.1 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2033401/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2067757] Re: AMD server do not support nested virtualization
that not the reason, rhel based distos disabel nested virt on amd by default if i recall correctly and you have to explicitly enabled it. its not supported on RHEL and is considered tech preview as there are several know bugs. intel is also not supported downstream for production workload however it much more mature and i bleive its enabled by defuat. nova is not filtering out svm. setting cpu_mode=none effectively is the same as cpu_mode=host-model so either libvirt is disbalinging it or its a kernel default issue. in either case i don't think this is a valid nova bug. ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2067757 Title: AMD server do not support nested virtualization Status in OpenStack Compute (nova): Invalid Bug description: From Linux kernel v4.19 onwards, the nested KVM parameter is enabled by default for Intel and AMD. (Though your Linux distribution might override this default, here is the official documentation of this: https://www.kernel.org/doc/html/v5.7/virt/kvm/running-nested- guests.html We are using OpenStack Zed on CentOS 9 and the VM is running on AMD compute nodes, and the kernel version is: 5.14.0-386.el9.x86_64. When we created an instance on AMD server and set the "cpu_mode" to "none", we found that the "svm" feature is passed to the instance XML on libvirt, but when we execu "lscpu" inside the VM, we can not see the "svm" feature, so we could not create a L2 instance inside the VM. However, when we set the "cpu_mode" to "host-passthrough" and hard reboot the VM, the "svm" is set correctly within the VM. For intel servers, we can create nested instances by default, and the "cpu_mode" is also set to "none", and everything works well. We guess it might because of some CPU feature dependencies which cause this issue. Can you help us to take a look? Thanks To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2067757/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2059800] Re: Image download immediately fails when glance returns 500
** Changed in: nova Status: In Progress => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2059800 Title: Image download immediately fails when glance returns 500 Status in OpenStack Compute (nova): Won't Fix Bug description: Description === nova-compute downloads a vm image from glance when launching an instance. It retries requests when it gets 503, but it does not when it gets 500. When glance uses cinder backend and a image volume is still used (for example because another client is downloading the same image), glance returns 500 and this results in immediate instance creation failure. Steps to reproduce == * Deploy glance with cinder image store * Upload an image * Create an image-boot instance from the image, while downloading the image in background Expected result === Instance creation succeeeds Actual result = Instance creation fails because of 500 error from glance Environment === This has been seen in Puppet OpenStack integration job, which uses RDO master. Logs & Configs == Example failure can be found in https://zuul.opendev.org/t/openstack/build/fc0e584a70f947d988ac057a8cc991c2 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2059800/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2080556] [NEW] old nova instance cant be started on post victoria deployments
Public bug reported: Downstream we had an interesting but report https://bugzilla.redhat.com/show_bug.cgi?id=2311875 Instances created after liberty but before victoria that request a numa topology but do not have CPU pinning cannot be started on post victoria nova. as part of the https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html spec we started tracking cpus as PCVU and VCPU resource classes but since a given instance would either have pinned cpus or floating cpus no changes too the instance numa topology object were required. with the introduction of mixed cpus in a single instnace https://specs.openstack.org/openstack/nova- specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html the instnace numa topology object was extended with a new pcpuset field. as part of that work the _migrate_legacy_object function was extended to default pcpuset to an empty set https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R212 for numa topologies that predate ovo and an new _migrate_legacy_dedicated_instance_cpuset function was added to migrate existing pinned instances and instnace with ovo in the db. what we missed in the review is that unpinned guests should have had the cell.pcpuset set to the empty set here https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R178 The new filed is not nullable and is not present in the existing json serialised object as a result accessing cell.pcpuset on object returned form the db will raise a NotImplementedError because it is unset if the VM was created between liberty and victoria. this only applies to non-pinned vms with a numa topology i.e. hw:mem_page_size= or hw:numa_nodes= ** Affects: nova Importance: High Assignee: sean mooney (sean-k-mooney) Status: In Progress ** Tags: numa ** Changed in: nova Assignee: (unassigned) => sean mooney (sean-k-mooney) ** Changed in: nova Importance: Undecided => High -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2080556 Title: old nova instance cant be started on post victoria deployments Status in OpenStack Compute (nova): In Progress Bug description: Downstream we had an interesting but report https://bugzilla.redhat.com/show_bug.cgi?id=2311875 Instances created after liberty but before victoria that request a numa topology but do not have CPU pinning cannot be started on post victoria nova. as part of the https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu-resources.html spec we started tracking cpus as PCVU and VCPU resource classes but since a given instance would either have pinned cpus or floating cpus no changes too the instance numa topology object were required. with the introduction of mixed cpus in a single instnace https://specs.openstack.org/openstack/nova- specs/specs/victoria/implemented/use-pcpu-vcpu-in-one-instance.html the instnace numa topology object was extended with a new pcpuset field. as part of that work the _migrate_legacy_object function was extended to default pcpuset to an empty set https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R212 for numa topologies that predate ovo and an new _migrate_legacy_dedicated_instance_cpuset function was added to migrate existing pinned instances and instnace with ovo in the db. what we missed in the review is that unpinned guests should have had the cell.pcpuset set to the empty set here https://github.com/openstack/nova/commit/867d4471013bf6a70cd3e9e809daf80ea358df92#diff-ed76deb872002cf64931c6d3f2d5967396240dddcb93da85f11886afc7dc4333R178 The new filed is not nullable and is not present in the existing json serialised object as a result accessing cell.pcpuset on object returned form the db will raise a NotImplementedError because it is unset if the VM was created between liberty and victoria. this only applies to non-pinned vms with a numa topology i.e. hw:mem_page_size= or hw:numa_nodes= To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2080556/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1969794] Re: backport of the fix for bug #1947370 make lock_path a requird config option when prvisouls it was optional
** Also affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1969794 Title: backport of the fix for bug #1947370 make lock_path a requird config option when prvisouls it was optional Status in OpenStack Compute (nova): New Status in os-brick: New Bug description: https://review.opendev.org/q/topic:bug%252F1947370 as part of fixing bug 1947370 (https://launchpad.net/bugs/1947370) https://review.opendev.org/c/openstack/os-brick/+/814139 made the external lock_path config option required with no default provided this was then backported breaking nova unit tests on stabel branches and potentially any deployment that upgrade to a new version of os-brick without this defined. i don't belive that such a backport is in line with stable policy and if it was to be backported a sane default like /tmp/os_brick_lock would be required to not break existing installs. this i currently breaking downstream unit test for redhat osp 17 and its also breaking the upstream stable wallayb unittest for nova. it is unclear if this has directly broken any real world deployment but it has the potential too. as noted in this revert patch https://review.opendev.org/c/openstack/os-brick/+/838871 it is trival to reproduce this git clone https://opendev.org/openstack/nova nova-test cd nova-test git checkout --track origin/stable/wallaby tox -e py3 ^ this shoudl fail with the lock_path excption cd .. git clone https://opendev.org/openstack/os-brick os-brick-revert cd os-brick-revert git fetch https://review.opendev.org/openstack/os-brick refs/changes/71/838871/1 && git checkout FETCH_HEAD cd ../nova-test .tox/py3/bin/python3 -m pip install -e ../os-brick-revert tox -e py3 that will no longer have the lock_path error .tox/py38/bin/python3 -m pip install os-brick\<4.3.3 while I'm not sure the revert is the correct way to proceed we will need to blacklist the broken os-brick release in the requirement repo and come up with a backportable fix for all affected branches. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1969794/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp