[Yahoo-eng-team] [Bug 2085709] [NEW] update_available_resource task loads the process by 100% with a large number of instances

Ivan Tkachuk Sun, 27 Oct 2024 07:17:30 -0700

Public bug reported:

Description
===========
When placing many instances (70 or more) on one node, the nova compute process 
periodically starts to load the processor by 100% (1 core), and all operations 
(restart, migration, etc.) start to take a very long time. 
After some time, the RabbitMQ starts to break the connection with node because 
it does not receive a heartbeat from it. And the node is marked as down in the 
list of hypervisors. 
Over time, the situation gets worse and worse, and the process starts to freeze 
more and more. 
Restarting the process gives a short-term improvement.
I found out that this happens because of the update_available_resource task, 
which collects information on all instances. 
When I disabled it 
update_resources_interval = -1 
In the configuration, everything started working as it should, the CPU load is 
minimal, all operations are performed quickly. 
The nova-compute process is running in one thread and with many simultaneous 
tasks to collect information from instances, it uses the entire core and 
freezes.
There are enough processor resources, it is not even 50% loaded. 
Screenshot from top - https://imgur.com/JXcDhS8
Here's an example of the nova processor usage before and after disabling the 
update_available_resource task - https://imgur.com/qqkhNla
I think this task need to be a separate thread so that it doesn't affect the 
service when there are a lot of instances.


Steps to reproduce
==================
create a small flavor to fit 100 instances on the node, and create at least 100 
instances. 
openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 
--vcpus 1 
openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a --network 
cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny  --min 100 --max 
100 test-nova

Expected result
===============
Nova-compute continues to work, does not disconnect or freeze. 

Actual result
=============
After some time after launching instances, nova-compute CPU usage periodically 
increases up to 100% when the process collects information about instances. And 
any operations take a long time until the task finishes processing. 

Environment
===========
Openstack release 2023.1
Nova-compute 27.5.1
Hypervisor Libvirt + KVM
Storage type - vm files are located on node disks with ext4 file system
CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Networking Neutron with OpenVSwitch

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: performance

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2085709

Title:
  update_available_resource task loads the process by 100% with a large
  number of instances

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When placing many instances (70 or more) on one node, the nova compute 
process periodically starts to load the processor by 100% (1 core), and all 
operations (restart, migration, etc.) start to take a very long time. 
  After some time, the RabbitMQ starts to break the connection with node 
because it does not receive a heartbeat from it. And the node is marked as down 
in the list of hypervisors. 
  Over time, the situation gets worse and worse, and the process starts to 
freeze more and more. 
  Restarting the process gives a short-term improvement.
  I found out that this happens because of the update_available_resource task, 
which collects information on all instances. 
  When I disabled it 
  update_resources_interval = -1 
  In the configuration, everything started working as it should, the CPU load 
is minimal, all operations are performed quickly. 
  The nova-compute process is running in one thread and with many simultaneous 
tasks to collect information from instances, it uses the entire core and 
freezes.
  There are enough processor resources, it is not even 50% loaded. 
  Screenshot from top - https://imgur.com/JXcDhS8
  Here's an example of the nova processor usage before and after disabling the 
update_available_resource task - https://imgur.com/qqkhNla
  I think this task need to be a separate thread so that it doesn't affect the 
service when there are a lot of instances. 

  Steps to reproduce
  ==================
  create a small flavor to fit 100 instances on the node, and create at least 
100 instances. 
  openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 
--vcpus 1 
  openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a 
--network cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny  --min 
100 --max 100 test-nova

  Expected result
  ===============
  Nova-compute continues to work, does not disconnect or freeze. 

  Actual result
  =============
  After some time after launching instances, nova-compute CPU usage 
periodically increases up to 100% when the process collects information about 
instances. And any operations take a long time until the task finishes 
processing. 

  Environment
  ===========
  Openstack release 2023.1
  Nova-compute 27.5.1
  Hypervisor Libvirt + KVM
  Storage type - vm files are located on node disks with ext4 file system
  CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
  Networking Neutron with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2085709/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2085709] [NEW] update_available_resource task loads the process by 100% with a large number of instances

Reply via email to