Public bug reported: Description =========== When placing many instances (70 or more) on one node, the nova compute process periodically starts to load the processor by 100% (1 core), and all operations (restart, migration, etc.) start to take a very long time. After some time, the RabbitMQ starts to break the connection with node because it does not receive a heartbeat from it. And the node is marked as down in the list of hypervisors. Over time, the situation gets worse and worse, and the process starts to freeze more and more. Restarting the process gives a short-term improvement. I found out that this happens because of the update_available_resource task, which collects information on all instances. When I disabled it update_resources_interval = -1 In the configuration, everything started working as it should, the CPU load is minimal, all operations are performed quickly. The nova-compute process is running in one thread and with many simultaneous tasks to collect information from instances, it uses the entire core and freezes. There are enough processor resources, it is not even 50% loaded. Screenshot from top - https://imgur.com/JXcDhS8 Here's an example of the nova processor usage before and after disabling the update_available_resource task - https://imgur.com/qqkhNla I think this task need to be a separate thread so that it doesn't affect the service when there are a lot of instances.
Steps to reproduce ================== create a small flavor to fit 100 instances on the node, and create at least 100 instances. openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 --vcpus 1 openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a --network cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny --min 100 --max 100 test-nova Expected result =============== Nova-compute continues to work, does not disconnect or freeze. Actual result ============= After some time after launching instances, nova-compute CPU usage periodically increases up to 100% when the process collects information about instances. And any operations take a long time until the task finishes processing. Environment =========== Openstack release 2023.1 Nova-compute 27.5.1 Hypervisor Libvirt + KVM Storage type - vm files are located on node disks with ext4 file system CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz Networking Neutron with OpenVSwitch ** Affects: nova Importance: Undecided Status: New ** Tags: performance -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2085710 Title: update_available_resource task loads the process by 100% with a large number of instances Status in OpenStack Compute (nova): New Bug description: Description =========== When placing many instances (70 or more) on one node, the nova compute process periodically starts to load the processor by 100% (1 core), and all operations (restart, migration, etc.) start to take a very long time. After some time, the RabbitMQ starts to break the connection with node because it does not receive a heartbeat from it. And the node is marked as down in the list of hypervisors. Over time, the situation gets worse and worse, and the process starts to freeze more and more. Restarting the process gives a short-term improvement. I found out that this happens because of the update_available_resource task, which collects information on all instances. When I disabled it update_resources_interval = -1 In the configuration, everything started working as it should, the CPU load is minimal, all operations are performed quickly. The nova-compute process is running in one thread and with many simultaneous tasks to collect information from instances, it uses the entire core and freezes. There are enough processor resources, it is not even 50% loaded. Screenshot from top - https://imgur.com/JXcDhS8 Here's an example of the nova processor usage before and after disabling the update_available_resource task - https://imgur.com/qqkhNla I think this task need to be a separate thread so that it doesn't affect the service when there are a lot of instances. Steps to reproduce ================== create a small flavor to fit 100 instances on the node, and create at least 100 instances. openstack flavor create --public m1.extra_tiny --id auto --ram 512 --disk 15 --vcpus 1 openstack server create --image 618ed5d4-f692-4ce3-af96-542c8ae9926a --network cc50edc1-3435-4854-ae7e-8215568a4249 --flavor m1.extra_tiny --min 100 --max 100 test-nova Expected result =============== Nova-compute continues to work, does not disconnect or freeze. Actual result ============= After some time after launching instances, nova-compute CPU usage periodically increases up to 100% when the process collects information about instances. And any operations take a long time until the task finishes processing. Environment =========== Openstack release 2023.1 Nova-compute 27.5.1 Hypervisor Libvirt + KVM Storage type - vm files are located on node disks with ext4 file system CPU - 2xIntel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz Networking Neutron with OpenVSwitch To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2085710/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp