GitHub user rishabhjain1997 added a comment to the discussion: Provisioning the
control VM failed in the Kubernetes cluster
Hi @anirudh09041 and @Pearl1594. I'm a teammate of @anirudh09041 and here is
what I've found so far. @Pearl1594 can you please help us with a solution based
on the RCA below?
1. The actual error seems to be the async job error:
```
$ grep "Complete async job-774"
/var/log/cloudstack/management/management-server.log
2026-05-05 06:50:56,274 DEBUG ... Complete async job-774, jobStatus: FAILED,
resultCode: 530,
result: ... "errortext":"Failed to setup Kubernetes cluster : test-cluster-23
is not in
usable state as the system is unable to access control node VMs of the
cluster"
```
KubernetesClusterStartWorker logs show a similar error:
```
2026-05-05 06:50:56,035 ERROR [c.c.k.c.a.KubernetesClusterStartWorker]
Failed to setup Kubernetes cluster : test-cluster-23 is not in usable state
as the system is unable to access control node VMs of the cluster
```
So the real failure seems to be: CKS finishes provisioning the VMs and FW/PF/LB
rules, then calls KubernetesClusterUtil.isKubernetesClusterServerRunning(),
which fails because the API server never came up.
The kube-apiserver never came up because each VM was stuck "Starting" for 50
minutes
```
$ grep -E "i-2-120-VM.*StartCommand|i-2-120-VM.*Start completed" \
/var/log/cloudstack/management/management-server.log
05:10:40,293 ... Sending { Cmd ... StartCommand ... id=120 ... <-
StartCommand sent
06:00:43,090 ... Start completed for VM ... i-2-120-VM <-
StartAnswer arrived
```
So cluster provisioning sat for ~100 min (50 min × 2 VMs sequentially) just
waiting for the VMs to be marked Started.
The 50-minute stall is the KVM agent repeatedly retrying patch.sh
```
$ grep "patch.sh -n i-2-120-VM" /var/log/cloudstack/agent/agent.log
05:15:43 WARN ... Process [2551078] for command
[/usr/share/cloudstack-common/scripts/
vm/hypervisor/kvm/patch.sh -n i-2-120-VM -c template=domP
name=test-cluster-23-control
...] timed out. Output is [].
05:20:43 WARN ... Process [2567946] ... timed out. Output is [].
05:25:43 WARN ... Process [2588320] ... timed out. Output is [].
05:30:43 WARN ... Process [2605401] ... timed out. Output is [].
05:35:43 WARN ... Process [2621993] ... timed out. Output is [].
05:40:43 WARN ... Process [2638725] ... timed out. Output is [].
05:45:43 WARN ... Process [2655334] ... timed out. Output is [].
05:50:43 WARN ... Process [2672092] ... timed out. Output is [].
05:55:43 WARN ... Process [2688808] ... timed out. Output is [].
06:00:43 WARN ... Process [2705703] ... timed out. Output is [].
```
patch.sh times out because qemu-guest-agent isn’t responding in the guest
```
[root@cloudstack ~]# virsh qemu-agent-command i-2-120-VM
'{"execute":"guest-ping"}' --timeout 5
error: Guest agent is not responding: QEMU guest agent is not connected
```
Here is the custom template that we’re using:
```
mysql> SELECT name, format, url FROM vm_template WHERE id = 217;
+------------------+--------+-------------------------------------------+
| name | format | url |
+------------------+--------+-------------------------------------------+
| Ubuntu-CKS-Fresh | QCOW2 | http://10.96.32.32/ubuntu-cks-fresh.qcow2 |
+------------------+--------+-------------------------------------------+
```
That template doesn't seem to have qemu-guest-agent installed (or running at
boot), so patch.sh has nothing to talk to — hence the timeout in the previous
block.
And patch.sh hangs trying to inject /var/lib/cloud/data/cmdline
```
mysql> SELECT name, value FROM configuration
WHERE name IN
('cloud.kubernetes.control.node.install.attempt.wait.duration',
'cloud.kubernetes.control.node.install.reattempt.count');
+-------------------------------------------------------------+-------+
| cloud.kubernetes.control.node.install.attempt.wait.duration | 15 |
| cloud.kubernetes.control.node.install.reattempt.count | 100 |
+-------------------------------------------------------------+-------+
```
Even after patch.sh finally gives up, cloud-init inside the guest can't make
progress: it sits in its own retry loop waiting for the binaries ISO at
/mnt/k8sdisk/ to appear. The retry length is configured globally as 100 × 15s =
25 min:
CKS only actually attaches the binaries ISO at the very end of the workflow
(06:50:55 below) — by then cloud-init has been dead for over an hour.
```
$ grep -E "Attached binaries ISO|Detached Kubernetes
binaries|isKubernetesClusterServerRunning" \
/var/log/cloudstack/management/management-server.log
06:50:55,955 INFO ... Attached binaries ISO for VM: ... i-2-120-VM
06:50:56,034 INFO ... Attached binaries ISO for VM: ... i-2-122-VM
06:50:56,035 ERROR ... Failed to setup Kubernetes cluster ...
06:50:56,145 INFO ... Detached Kubernetes binaries from VM: ... i-2-120-VM
06:50:56,271 INFO ... Detached Kubernetes binaries from VM: ... i-2-122-VM
```
And the ISO attach is mmediately followed by a detach 1s later.
Summary:
**qemu-guest-agent is missing or disabled in the Ubuntu-CKS-Fresh template, so
the host's patch.sh has nothing to talk to and times out for 50 minutes per VM
(~100 min total for both). Without patch.sh, the cmdline metadata is never
injected, the cks-init module inside cloud-init never runs, and kubeadm-init
never executes. When CKS eventually calls isKubernetesClusterServerRunning(),
the API server isn't up, the readiness check fails, and the cluster goes to
Error.**
@Pearl1594 please let us know how to remediate this issue. Any help would be
greatly appreciated!
GitHub link:
https://github.com/apache/cloudstack/discussions/13056#discussioncomment-16832768
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]