[I] Redundant VPC routers stuck in BACKUP; cannot add default route, interface remains down [cloudstack]

via GitHub Mon, 27 Jan 2025 04:59:22 -0800


Rid opened a new issue, #10281:
URL: https://github.com/apache/cloudstack/issues/10281


   ### problem
   
   We have a new CloudStack 4.20.0.0 environment with redundant VPC routers, 
but they never transition to MASTER. Instead:
   
   1. Each VR tries to bring up the public interface (eth1) and add a default 
route, e.g.:
   
   ```
   ip route add default via x.x.x.x table Table_eth1 proto static
   ```
   
   …but this fails with exit code 2 (“Nexthop has invalid gateway”). We believe 
it fails as the interface remains in the DOWN state.
   
   2. The VR script then tears eth1 down, inserts a “throw x.x.x.0/27” route in 
Table_eth1, and marks the router as BACKUP or FAULT.
   
   3. Keepalived never starts because the script believes routing is broken. 
Thus no VRRP negotiation occurs, and no router becomes MASTER.
   
   We can manually bring eth1 up (ip link set eth1 up) and add a default route 
to the main or custom table, and it works fine. However, CloudStack’s scripts 
immediately revert the interface to DOWN again and keep the router in BACKUP.
   
   **Key details:**
   
   VR logs show repeated attempts to configure the default route via x.x.x.x 
inside Table_eth1, followed by throw x.x.x.0/27.
   Even if we remove the throw route, the script tries to add a route while 
eth1 is still down, fails, and resets to BACKUP.
   Because of this cycle, we never see /etc/keepalived/keepalived.conf 
generated or keepalived started.
   
   ### versions
   
   Apache CloudStack: 4.20.0.0
   System VM template: Debian GNU/Linux 12
   Hypervisor: KVM
   Networking: Advanced networking with VLAN trunking, rp_filter disabled
   
   We modified the systemvm template to add a static route which our setup 
needs. We added `/etc/network/if-up.d/91-add-route`:
   
   ```
   #!/bin/sh
   #
   # /etc/network/if-up.d/91-add-route
   #
   # This script is automatically invoked by ifup each time
   # an interface is brought up. The environment variable $IFACE
   # contains the interface name (e.g., eth0, ens3, etc.).
   
   [ "$IFACE" = "lo" ] && exit 0
   
   # Gather *all* IPv4 addresses (CIDR format) on this interface
   IP_CIDR_LIST=$(ip -o -4 addr show dev "$IFACE" | awk '{print $4}')
   [ -z "$IP_CIDR_LIST" ] && exit 0  # no IPv4 addresses on $IFACE, so exit
   
   # Loop through each IPv4 address on this interface
   for IP_CIDR in $IP_CIDR_LIST
   do
     # Extract the actual IP address (without /mask)
     IP_ADDR=$(echo "$IP_CIDR" | cut -d '/' -f 1)
   
     # Check if IP is in x.x.x.x/27
     if echo "$IP_ADDR" | grep -Eq '^-redacted-$'; then
       echo "Interface $IFACE has IP $IP_ADDR in x.x.x.x/27; adding route..."
       ip route add x.x.x.x/27 dev "$IFACE" scope link src "$IP_ADDR" 
2>/dev/null || true
   
       # Once we've added the route for the first matching IP, we're done.
       exit 0
     fi
   done
   
   exit 0
   ```
   
   We do not believe this is related to the issue.
   
   ### The steps to reproduce the bug
   
   1. Install or upgrade to CloudStack 4.20.0.0 with advanced networking.
   2. Create a VPC offering that uses redundant VR.
   3. Deploy a VPC that picks up two VRs.
   4. Observe in /var/log/cloud.log (and the VR’s cloud.log) that each router 
fails to add its default route via x.x.x.x, then tears down eth1 and remains 
BACKUP/FAULT indefinitely.
   
   ### What to do about it?
   
   Ideally, the VR script should:
   
   1. Ensure eth1 is brought up before adding the default route in the policy 
routing table (Table_eth1).
   2. Avoid placing a “throw” route for x.x.x.0/27 on the router that’s 
intended to be MASTER.
   3. Generate and start keepalived once the router is designated MASTER (or 
“PRIMARY” per the cmdline), so it can finalize the interface config instead of 
reverting to BACKUP.
   
   If you need more logs or specifics, we can provide full VR logs and examples 
of the failing ip route commands. Let us know if you have any questions or 
potential workarounds—thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cloudstack.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Redundant VPC routers stuck in BACKUP; cannot add default route, interface remains down [cloudstack]

Reply via email to