jdhirst opened a new issue, #11104: URL: https://github.com/apache/cloudstack/issues/11104
### problem ## Issue Details After switching over to main for testing the CKS improvements, I ran into an issue where my compute nodes lose their local storage pools in libvirt. The pools exist in Cloudstack however the pools get undefined. This can be temporarily fixed by manually defining the pools with the same UUID via `virsh pool-define`, however they get removed after as well. Both of the compute nodes have `local.storage.uuid` defined in their agent.properties file. Once this issue starts to happen, restarting the agent does not cause the pool to come back. However, if you change the UUID in `agent.properties`, a new pool gets created and will start without issue. The affect of this is that after the pool is removed, existing instances continue to run, however new instances cannot be created until the pool is manually recreated. Creating a new pool with a different UUID works for a time, but eventually also gets removed. I'm not exactly sure what is triggering the removal. ## Log Snippets Both management server and agent encounter errors trying to find the pool: ``` 2025-06-29 07:36:58,138 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] (StatsCollector-6:[ctx-8a6c9ea0]) (logid:58ac492b) Details from executing class com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Could not fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt due to org.libvirt.LibvirtException: Storage pool not found: no storage pool with matching uuid '27ea62dc-ded2-4a25-9115-277c981f33fa' 2025-06-29 07:36:58,096 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-2:[]) (logid:58ac492b) Processing command: com.cloud.agent.api.GetVolumeStatsCommand 2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system 2025-06-29 07:36:58,096 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Trying to fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt 2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection at: qemu:///system 2025-06-29 07:36:58,097 DEBUG [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-2:[]) (logid:58ac492b) Could not find storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa in libvirt ``` The pool is being removed by the management server: ``` 2025-06-29 03:36:47,496 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Removing pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} from host Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"} 2025-06-29 03:36:47,511 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Sending { Cmd , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"13","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d","host":"10.25.0.3","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//492af6d4-b5d1-3dc8-8a50-040101d28230","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] } 2025-06-29 03:36:47,563 DEBUG [c.c.a.t.Request] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 2-3926857400090361861: Received: { Ans: , MgmtId: 90520738888109, via: 2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 10, { Answer } } 2025-06-29 03:36:47,576 INFO [o.a.c.s.d.p.DefaultHostListener] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Connection removed between storage pool: StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} and host: Host {"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"} 2025-06-29 03:36:47,606 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Found storage pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} of type Filesystem with overprovisioning factor 2 2025-06-29 03:36:47,608 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Total over provisioned capacity of the pool StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"} is (306.79 GB) 329410707456 2025-06-29 03:36:47,629 DEBUG [c.c.s.StorageManagerImpl] (AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Successfully set Capacity - (306.79 GB) 329410707456 for capacity type - 9 , DataCenterId - 1, Pool - StoragePool {"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}, PodId 1 ``` Agent's logs showing pool removal: ``` 2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Request:Seq 1-4725683383995203589: { Cmd , MgmtId: 90520738888109, via: 1, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"12","uuid":"27ea62dc-ded2-4a25-9115-277c981f33fa","host":"10.25.0.2","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//36cfe92b-5de4-3baf-9813-796fbddeb8af","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}] } 2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Processing command: com.cloud.agent.api.DeleteStoragePoolCommand 2025-06-29 03:36:47,233 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Attempting to remove storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt 2025-06-29 03:36:47,234 DEBUG [kvm.resource.LibvirtConnection] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Looking for libvirtd connection at: qemu:///system 2025-06-29 03:36:47,235 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa has no corresponding secret. Not removing any secret. 2025-06-29 03:36:47,236 INFO [kvm.storage.LibvirtStorageAdaptor] (AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa was successfully removed from libvirt. ``` The pools still show as up in the UI:  And in the DB: (as you can see, I attempted to replace the pools with new ones to see what would happen) |id|name|uuid|pool_type|port|data_center_id|pod_id|cluster_id|used_bytes|capacity_bytes|host_address|user_info|path|created|removed|update_time|status|storage_provider_name|scope|hypervisor|managed|capacity_iops|parent|used_iops| |--|----|----|---------|----|--------------|------|----------|----------|--------------|------------|---------|----|-------|-------|-----------|------|---------------------|-----|----------|-------|-------------|------|---------| |1|cs-compute1.cloud.redacted.net-local-0322a17b||Filesystem|0|1|1|1|424638771200|913383583744|10.25.0.2||/var/lib/libvirt/images|2024-09-17 12:14:34|2025-06-28 12:44:14|2025-06-27 12:40:43|Up|DefaultPrimary|HOST|KVM|0||0|120| |2|cs-compute2.cloud.redacted.net-local-7aa8e196||Filesystem|0|1|1|1|114410532864|164705353728|10.25.0.3||/var/lib/libvirt/images|2024-09-17 12:17:05|2025-06-28 12:43:04|2025-06-28 09:09:47|Up|DefaultPrimary|HOST|KVM|0||0|60| |4|nfs|10805e0a-2205-3158-993f-9350b08b9137|NetworkFilesystem|2049|1|1|1|36540069576704|61401196134400|nas.redacted.net||/volume1/cs-primary|2024-09-19 16:55:58||2025-06-29 06:34:38|Up|DefaultPrimary|CLUSTER|KVM|0||0|| |5|nfs-ssd|80a947f2-ab02-3d8c-b1c0-7ba964ae0e80|NetworkFilesystem|2049|1|1|1|358665551872|1418370088960|nas.redacted.net||/volume2/cs-primary-ssd|2024-09-28 10:31:47||2025-06-29 06:34:38|Up|DefaultPrimary|CLUSTER|KVM|0||0|| |12|cs-compute1.cloud.redacted.net-local-27ea62dc|27ea62dc-ded2-4a25-9115-277c981f33fa|Filesystem|0|1|1|1|424638771200|913383583744|10.25.0.2||/var/lib/libvirt/images|2025-06-28 13:15:34||2025-06-29 00:38:22|Up|DefaultPrimary|HOST|KVM|0||0|29| |13|cs-compute2.cloud.redacted.net-local-d997caf8|d997caf8-e98c-498c-aac2-3016f6ae2f5d|Filesystem|0|1|1|1|114449989632|164705353728|10.25.0.3||/var/lib/libvirt/images|2025-06-28 13:21:47||2025-06-29 00:49:49|Up|DefaultPrimary|HOST|KVM|0||0|60| The removed column is set to NULL for the two current local storage pools even **after** they have been removed from the hosts. ## Tracing I attempted to do some tracing of this issue (see trace logs attached), however I am still unable to find what triggers the pool removal call in the first place. ## Log Files ### versions I'm using the latest nightly build as of yesterday, however this issue has been there ever since I switched over to nightly. Current build: 4.21.0.0-SNAPSHOT.20250628 OS: RHEL 8.10 The cluster is made up of one management node and two KVM compute nodes, all running the same build. ### The steps to reproduce the bug 1. 2. 3. ... ### What to do about it? _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org