jdhirst opened a new issue, #11104:
URL: https://github.com/apache/cloudstack/issues/11104

   ### problem
   
   ## Issue Details
   
   After switching over to main for testing the CKS improvements, I ran into an 
issue where my compute nodes lose their local storage pools in libvirt. The 
pools exist in Cloudstack however the pools get undefined. This can be 
temporarily fixed by manually defining the pools with the same UUID via `virsh 
pool-define`, however they get removed after as well.
   
   Both of the compute nodes have `local.storage.uuid` defined in their 
agent.properties file. Once this issue starts to happen, restarting the agent 
does not cause the pool to come back. However, if you change the UUID in 
`agent.properties`, a new pool gets created and will start without issue.
   
   The affect of this is that after the pool is removed, existing instances 
continue to run, however new instances cannot be created until the pool is 
manually recreated. 
   
   Creating a new pool with a different UUID works for a time, but eventually 
also gets removed. I'm not exactly sure what is triggering the removal.
   
   ## Log Snippets
   
   Both management server and agent encounter errors trying to find the pool:
   ```
   2025-06-29 07:36:58,138 DEBUG [c.c.a.m.ClusteredAgentManagerImpl] 
(StatsCollector-6:[ctx-8a6c9ea0]) (logid:58ac492b) Details from executing class 
com.cloud.agent.api.GetVolumeStatsCommand: Can't get vm disk stats: Could not 
fetch storage pool 27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt due to 
org.libvirt.LibvirtException: Storage pool not found: no storage pool with 
matching uuid '27ea62dc-ded2-4a25-9115-277c981f33fa'
   2025-06-29 07:36:58,096 DEBUG [cloud.agent.Agent] 
(AgentRequest-Handler-2:[]) (logid:58ac492b) Processing command: 
com.cloud.agent.api.GetVolumeStatsCommand
   2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] 
(AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection 
at: qemu:///system
   2025-06-29 07:36:58,096 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-2:[]) (logid:58ac492b) Trying to fetch storage pool 
27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
   2025-06-29 07:36:58,096 DEBUG [kvm.resource.LibvirtConnection] 
(AgentRequest-Handler-2:[]) (logid:58ac492b) Looking for libvirtd connection 
at: qemu:///system
   2025-06-29 07:36:58,097 DEBUG [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-2:[]) (logid:58ac492b) Could not find storage pool 
27ea62dc-ded2-4a25-9115-277c981f33fa in libvirt
   ```
   
   The pool is being removed by the management server:
   ```
   2025-06-29 03:36:47,496 DEBUG [c.c.s.StorageManagerImpl] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Removing pool 
StoragePool 
{"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}
 from host Host 
{"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
   2025-06-29 03:36:47,511 DEBUG [c.c.a.t.Request] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 
2-3926857400090361861: Sending  { Cmd , MgmtId: 90520738888109, via: 
2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"13","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d","host":"10.25.0.3","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//492af6d4-b5d1-3dc8-8a50-040101d28230","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}]
 }
   2025-06-29 03:36:47,563 DEBUG [c.c.a.t.Request] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Seq 
2-3926857400090361861: Received:  { Ans: , MgmtId: 90520738888109, via: 
2(cs-compute2.cloud.redacted.net), Ver: v1, Flags: 10, { Answer } }
   2025-06-29 03:36:47,576 INFO  [o.a.c.s.d.p.DefaultHostListener] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Connection removed 
between storage pool: StoragePool 
{"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}
 and host: Host 
{"id":2,"name":"cs-compute2.cloud.redacted.net","type":"Routing","uuid":"dfd0102e-f60c-44b4-8a12-70bc46fc8154"}
   2025-06-29 03:36:47,606 DEBUG [c.c.s.StorageManagerImpl] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Found storage pool 
StoragePool 
{"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}
 of type Filesystem with overprovisioning factor 2
   2025-06-29 03:36:47,608 DEBUG [c.c.s.StorageManagerImpl] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Total over provisioned 
capacity of the pool StoragePool 
{"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"}
 is (306.79 GB) 329410707456
   2025-06-29 03:36:47,629 DEBUG [c.c.s.StorageManagerImpl] 
(AgentConnectTaskPool-4:[ctx-d22dee82]) (logid:5fbc7c2a) Successfully set 
Capacity - (306.79 GB) 329410707456 for capacity type - 9 , DataCenterId - 1, 
Pool - StoragePool 
{"id":13,"name":"cs-compute2.cloud.redacted.net-local-d997caf8","poolType":"Filesystem","uuid":"d997caf8-e98c-498c-aac2-3016f6ae2f5d"},
 PodId 1
   ```
   
   Agent's logs showing pool removal:
   ```
   2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Request:Seq 1-4725683383995203589: 
 { Cmd , MgmtId: 90520738888109, via: 1, Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.DeleteStoragePoolCommand":{"_pool":{"id":"12","uuid":"27ea62dc-ded2-4a25-9115-277c981f33fa","host":"10.25.0.2","path":"/var/lib/libvirt/images","port":"0","type":"Filesystem"},"_localPath":"/mnt//36cfe92b-5de4-3baf-9813-796fbddeb8af","_removeDatastore":"false","wait":"0","bypassHostMaintenance":"false"}}]
 }
   2025-06-29 03:36:47,233 DEBUG [cloud.agent.Agent] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Processing command: 
com.cloud.agent.api.DeleteStoragePoolCommand
   2025-06-29 03:36:47,233 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Attempting to remove storage pool 
27ea62dc-ded2-4a25-9115-277c981f33fa from libvirt
   2025-06-29 03:36:47,234 DEBUG [kvm.resource.LibvirtConnection] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Looking for libvirtd connection 
at: qemu:///system
   2025-06-29 03:36:47,235 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 
27ea62dc-ded2-4a25-9115-277c981f33fa has no corresponding secret. Not removing 
any secret.
   2025-06-29 03:36:47,236 INFO  [kvm.storage.LibvirtStorageAdaptor] 
(AgentRequest-Handler-1:[]) (logid:4c5c4abb) Storage pool 
27ea62dc-ded2-4a25-9115-277c981f33fa was successfully removed from libvirt.
   ```
   
   The pools still show as up in the UI:
   
![Image](https://github.com/user-attachments/assets/6779f21a-e4c7-424a-9ad6-76d3acf1cb00)
   
   And in the DB: (as you can see, I attempted to replace the pools with new 
ones to see what would happen)
   
|id|name|uuid|pool_type|port|data_center_id|pod_id|cluster_id|used_bytes|capacity_bytes|host_address|user_info|path|created|removed|update_time|status|storage_provider_name|scope|hypervisor|managed|capacity_iops|parent|used_iops|
   
|--|----|----|---------|----|--------------|------|----------|----------|--------------|------------|---------|----|-------|-------|-----------|------|---------------------|-----|----------|-------|-------------|------|---------|
   
|1|cs-compute1.cloud.redacted.net-local-0322a17b||Filesystem|0|1|1|1|424638771200|913383583744|10.25.0.2||/var/lib/libvirt/images|2024-09-17
 12:14:34|2025-06-28 12:44:14|2025-06-27 
12:40:43|Up|DefaultPrimary|HOST|KVM|0||0|120|
   
|2|cs-compute2.cloud.redacted.net-local-7aa8e196||Filesystem|0|1|1|1|114410532864|164705353728|10.25.0.3||/var/lib/libvirt/images|2024-09-17
 12:17:05|2025-06-28 12:43:04|2025-06-28 
09:09:47|Up|DefaultPrimary|HOST|KVM|0||0|60|
   
|4|nfs|10805e0a-2205-3158-993f-9350b08b9137|NetworkFilesystem|2049|1|1|1|36540069576704|61401196134400|nas.redacted.net||/volume1/cs-primary|2024-09-19
 16:55:58||2025-06-29 06:34:38|Up|DefaultPrimary|CLUSTER|KVM|0||0||
   
|5|nfs-ssd|80a947f2-ab02-3d8c-b1c0-7ba964ae0e80|NetworkFilesystem|2049|1|1|1|358665551872|1418370088960|nas.redacted.net||/volume2/cs-primary-ssd|2024-09-28
 10:31:47||2025-06-29 06:34:38|Up|DefaultPrimary|CLUSTER|KVM|0||0||
   
|12|cs-compute1.cloud.redacted.net-local-27ea62dc|27ea62dc-ded2-4a25-9115-277c981f33fa|Filesystem|0|1|1|1|424638771200|913383583744|10.25.0.2||/var/lib/libvirt/images|2025-06-28
 13:15:34||2025-06-29 00:38:22|Up|DefaultPrimary|HOST|KVM|0||0|29|
   
|13|cs-compute2.cloud.redacted.net-local-d997caf8|d997caf8-e98c-498c-aac2-3016f6ae2f5d|Filesystem|0|1|1|1|114449989632|164705353728|10.25.0.3||/var/lib/libvirt/images|2025-06-28
 13:21:47||2025-06-29 00:49:49|Up|DefaultPrimary|HOST|KVM|0||0|60|
   
   The removed column is set to NULL for the two current local storage pools 
even **after** they have been removed from the hosts.
   
   ## Tracing
   I attempted to do some tracing of this issue (see trace logs attached), 
however I am still unable to find what triggers the pool removal call in the 
first place. 
   
   ## Log Files
   
   
   ### versions
   
   I'm using the latest nightly build as of yesterday, however this issue has 
been there ever since I switched over to nightly. 
   
   Current build: 4.21.0.0-SNAPSHOT.20250628
   OS: RHEL 8.10
   
   The cluster is made up of one management node and two KVM compute nodes, all 
running the same build. 
   
   
   ### The steps to reproduce the bug
   
   1.
   2.
   3.
   ...
   
   
   ### What to do about it?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to