hodie-aurora opened a new issue, #11431:
URL: https://github.com/apache/cloudstack/issues/11431

   ### problem
   
   Issue Type: Bug Report
   
   CloudStack Version:
   4.20.x (Based on Virtual Router version 4.20.0)
   
   Hypervisor:
   KVM
   
   Kubernetes Template:
   setup-v1.33.1-calico-x86_64.iso
   
   Summary:
   When creating a high-availability (HA) Kubernetes cluster using the 
CloudStack Kubernetes Service (CKS) in a VPC, the cluster initialization 
process gets stuck and eventually times out. The root cause appears to be that 
only one of the automatically created port forwarding rules for the control 
plane nodes is functional, preventing the management server and nodes from 
communicating with each other.
   
   Steps to Reproduce:
   
   Create a new VPC with a single network tier (e.g., 10.1.0.0/24).
   Navigate to Compute -> Kubernetes and start creating a new HA cluster.
   Select the setup-v1.33.1-calico-x86_64.iso template.
   Configure it with 3 control plane nodes and 1 worker node.
   Select the VPC network created in step 1.
   Crucially, leave the "Load Balancer IP" field empty, allowing CloudStack to 
automatically acquire a public IP and create the necessary rules.
   Launch the cluster.
   Expected Results:
   The cluster VMs are created, all necessary port forwarding rules (for SSH on 
port 22 and the K8s API on port 6443) are created on the VPC's virtual router, 
and the cluster successfully initializes, reaching a "Running" state. All nodes 
should be accessible via their respective forwarded ports.
   
   Actual Results:
   
   The cluster VMs are created successfully.
   All port forwarding rules are listed correctly in the CloudStack UI.
   The cluster state remains "Starting" for a long time, with management server 
logs repeatedly showing: Waiting for Kubernetes cluster ... control node VMs to 
be accessible.
   After a timeout, the cluster enters an "Error" state.
   Network tests (telnet <public_ip> <forwarded_port>) reveal that only one of 
the SSH port forwarding rules works. The others fail with a No route to host 
error. The single working rule is not always for the same node across different 
creation attempts.
   Diagnostics Performed:
   
   Port Forwarding Test:
   
   basic
   
   # telnet 192.168.10.225 2224  --> Connected successfully
   # telnet 192.168.10.225 2222  --> telnet: Unable to connect to remote host: 
No route to host
   # telnet 192.168.10.225 2223  --> telnet: Unable to connect to remote host: 
No route to host
   # telnet 192.168.10.225 2225  --> telnet: Unable to connect to remote host: 
No route to host
   Virtual Router iptables Inspection:
   I logged into the VPC's virtual router (r-53-VM) and confirmed that all DNAT 
rules are present and correct. This proves the issue is not with the virtual 
router's configuration.
   
   apache
   
   # iptables-save | grep DNAT
   -A PREROUTING -d 192.168.10.225/32 -p tcp -m tcp --dport 6443 -j DNAT 
--to-destination 10.1.0.209:6443
   -A PREROUTING -d 192.168.10.225/32 -p tcp -m tcp --dport 2222 -j DNAT 
--to-destination 10.1.0.209:22
   -A PREROUTING -d 192.168.10.225/32 -p tcp -m tcp --dport 2223 -j DNAT 
--to-destination 10.1.0.80:22
   -A PREROUTING -d 192.168.10.225/32 -p tcp -m tcp --dport 2224 -j DNAT 
--to-destination 10.1.0.254:22
   -A PREROUTING -d 192.168.10.225/32 -p tcp -m tcp --dport 2225 -j DNAT 
--to-destination 10.1.0.72:22
   ... (and corresponding OUTPUT chain rules)
   Hypothesis:
   Since the virtual router is correctly forwarding traffic, the No route to 
host error strongly suggests that the packets are being rejected by the 
destination K8s node VMs themselves. The most likely cause is a default-on 
firewall (like firewalld or ufw) within the setup-v1.33.1-calico-x86_64.iso 
template. This firewall blocks incoming SSH connections from the virtual 
router, preventing cluster setup.
   
   The fact that one node is sometimes accessible might be due to timing, where 
one node's firewall is temporarily disabled during its initial setup phase 
before the rest of the cluster setup fails.
   
   Request:
   Could the development team please investigate if the CKS templates have a 
firewall enabled by default? If so, this seems to break the HA cluster creation 
process in a VPC and should either be disabled or pre-configured to allow 
traffic from the VPC's private network range.
   
   
   
   ### versions
   
   CloudStack Version: ~4.20.1.0 (inferred from Virtual Router software version)
   Hypervisor: KVM
   Primary Storage: NFS (inferred from compute offering name 8cpu-16g-nfs)
   Network: CloudStack VPC with an Isolated Guest Network and Virtual Router.
   CKS Template: setup-v1.33.1-calico-x86_64.iso
   
   
   ### The steps to reproduce the bug
   
   Create a new VPC with a single network tier (e.g., 10.1.0.0/24).
   Navigate to Compute -> Kubernetes and start creating a new HA cluster.
   Select the setup-v1.33.1-calico-x86_64.iso template.
   Configure it with 3 control plane nodes and 1 worker node.
   Select the VPC network created in step 1.
   Leave the "Load Balancer IP" field empty to allow CloudStack to 
automatically acquire an IP.
   Launch the cluster and observe that it fails to initialize, with only one 
forwarded port being accessible.
   
   ### What to do about it?
   
   Proposed Long-Term Fix:
   The CKS template setup-v1.33.1-calico-x86_64.iso should be modified. The 
firewall inside the template should either be disabled by default or, 
preferably, be pre-configured with rules that allow all necessary traffic for 
cluster setup. At a minimum, it should allow inbound traffic from the VPC's 
private network CIDR (e.g., 10.1.0.0/24 in this case) on the required ports 
(like SSH port 22 and Kubernetes API port 6443) to ensure the cluster 
initialization can complete successfully without manual intervention.
   
   Immediate Workaround for Users:
   For users encountering this bug, a potential workaround is to use the noVNC 
console from the CloudStack UI to access each Kubernetes node VM. Once logged 
in, manually disable the firewall. For example:
   
   On systemd-based systems: systemctl stop firewalld && systemctl disable 
firewalld
   On Debian/Ubuntu systems: ufw disable
   After disabling the firewalls on all nodes, the cluster setup process should 
be able to proceed.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to