CMStevenKusters opened a new issue, #13079:
URL: https://github.com/apache/cloudstack/issues/13079

   ## `keystore-setup` fails to create `agent.properties` on fresh KVM hosts; 
KVM `addHost` cert flow then loops with "Failed to find keystore passphrase" 
and `setupAgentSecurity()` throws "Failed to setup certificate in the KVM 
agent's keystore file" (4.22)
   
   ### problem
   
   When `LibvirtServerDiscoverer.setupAgentSecurity()` invokes 
`scripts/util/keystore-setup` on a freshly-installed KVM agent (one whose 
`/etc/cloudstack/agent/` directory contains only the package-shipped 
`environment.properties` and `log4j-cloud.xml` - i.e. the natural state on a 
first-time `addHost`, or after an operator has wiped agent state to recover 
from a botched provisioning attempt), the script silently fails to create 
`agent.properties` and silently fails to persist `keystore.passphrase=...` to 
it. The downstream `keystore-cert-import` invocation then reads the 
missing/empty `agent.properties`, exits 1 with "Failed to find keystore 
passphrase from file: /etc/cloudstack/agent/agent.properties, quitting!", and 
after the 3-attempt retry loop in `setupAgentSecurity()` the management server 
throws `CloudRuntimeException("Failed to setup certificate in the KVM agent's 
keystore file, please see logs and configure manually!")`. The host never gets 
added.
   
   The defect is in `scripts/util/keystore-setup`. The passphrase-write block 
is gated on `[ -f "$PROPS_FILE" ]`:
   
   ```bash
   PROPS_FILE="$1"
   KS_FILE="$2"
   KS_PASS="$3"
   KS_VALIDITY="$4"
   CSR_FILE="$5"
   ...
   # Re-use existing password or use the one provided
   if [ -f "$PROPS_FILE" ]; then           # <- gates the entire write block
       OLD_PASS=$(sed -n '/^keystore.passphrase/p' "$PROPS_FILE" | sed 
's/^keystore.passphrase=//g')
       if [ -n "${OLD_PASS// }" ]; then
           KS_PASS="$OLD_PASS"
       else
           sed -i "/^keystore.passphrase.*/d" "$PROPS_FILE" 2>&1 | $LOGGER_CMD 
|| true
           echo "keystore.passphrase=$KS_PASS" >> "$PROPS_FILE"   # <- never 
reached if file missing
       fi
   fi
   ```
   
   The intent appears to have been "if a previous props file exists, reuse its 
passphrase; otherwise use the one provided as `$3`" - and on the otherwise 
path, write `$3` into a fresh file. But the *write* is also inside the gate, so 
on a fresh host the new passphrase is never persisted. The script does proceed 
(it has `KS_PASS` from arg `$3` so `keytool -genkey` and `keytool -certreq` 
work), so `cloud.jks` and `cloud.csr` come out fine. The trailing `chmod 600 
"$PROPS_FILE"` fails with `chmod: cannot access 
'/etc/cloudstack/agent/agent.properties': No such file or directory` to stderr, 
but that's swallowed via `|| $LOGGER_CMD ...` and the script's final exit 
status is the logger's (always 0). MS reads the CSR from stdout and proceeds.
   
   The next stage (`keystore-cert-import`, called with all 10 args correctly 
populated by `setupAgentSecurity()`) then reads passphrase from 
`agent.properties`:
   
   ```bash
   KS_PASS=$(sed -n '/keystore.passphrase/p' "$PROPS_FILE" 2>/dev/null  | sed 
's/keystore.passphrase=//g' 2>/dev/null)
   
   if [ -z "${KS_PASS// }" ]; then
       echo "Failed to find keystore passphrase from file: $PROPS_FILE, 
quitting!"
       exit 1
   fi
   ```
   
   (Note: arg `$2` `KS_PASS` is overwritten here on the KVM path - the arg `$2` 
is only honoured inside the `[ ! -f "$LIBVIRTD_FILE" ]` systemvm branch above. 
So the file is the only source of truth on KVM agents.)
   
   `KS_PASS` ends up empty, script exits 1.
   
   `setupAgentSecurity()` retries three times (each attempt deterministically 
the same since the bug is reproduced from clean state on every retry), then 
throws.
   
   ### evidence
   
   `bash -x` trace of `keystore-cert-import` on the agent during a failing 
`addHost`, captured by wrapping the upstream script with a tracing shim:
   
   ```
   + '[' '!' -f /etc/libvirt/libvirtd.conf ']'
   ++ sed -n /keystore.passphrase/p /etc/cloudstack/agent/agent.properties
   ++ sed s/keystore.passphrase=//g
   + KS_PASS=
   + '[' -z '' ']'
   + echo 'Failed to find keystore passphrase from file: 
/etc/cloudstack/agent/agent.properties, quitting!'
   + exit 1
   ```
   
   State of `/etc/cloudstack/agent/` immediately after `keystore-setup` 
returned (note the missing `agent.properties` despite the script having 
"succeeded"):
   
   ```
   total 24
   -rwxr-xr-x  1 root root   904 Nov  7 environment.properties     <- package
   -rwxr-xr-x  1 root root  2601 Nov  7 log4j-cloud.xml            <- package
   -rw-------  1 root root  2742 ...    cloud.jks                  <- created 
by keystore-setup
                                                                    <- 
agent.properties NOT created
                                                                    <- 
cloud.csr written + cat'd to stdout (consumed by MS)
   ```
   
   Argument-count verification on the agent side (instrumented script logging 
argc/argv on entry):
   
   ```
   keystore-setup       argc=5  argv=[/etc/cloudstack/agent/agent.properties 
/etc/cloudstack/agent/cloud.jks <16-char-random> 365 
/etc/cloudstack/agent/cloud.csr]
   keystore-cert-import argc=10 arg6_len=1696 arg8_len=1814 arg10_len=0
   ```
   
   So `setupAgentSecurity()` is sending all 5 / 10 args correctly - the 
dispatch side is fine. The defect is purely inside 
`scripts/util/keystore-setup`.
   
   `management-server.log` for the same `addHost` attempt - three back-to-back 
invocations of `keystore-cert-import`, all returning the same "Failed to find 
keystore passphrase" output, then the wrapper exception:
   
   ```
   ... DEBUG c.c.u.s.SSHCmdHelper Executing cmd: sudo 
/usr/share/cloudstack-common/scripts/util/keystore-cert-import ...
   ... DEBUG c.c.u.s.SSHCmdHelper SSH command output: Failed to find keystore 
passphrase from file: /etc/cloudstack/agent/agent.properties, quitting!
   ... [retry 2: same output]
   ... [retry 3: same output]
   ... WARN  c.c.h.k.d.KvmServerDiscoverer can't setup agent, due to 
com.cloud.utils.exception.CloudRuntimeException: Failed to setup certificate in 
the KVM agent's keystore file, please see logs and configure manually!
   ```
   
   ### versions
   
   - Apache CloudStack 4.22.0.0 (apache packaging, apt install)
   - Hypervisor: KVM
   - Agent host OS: Ubuntu Server 24.04 LTS, OpenJDK 17
   - Management server OS: Ubuntu Server 24.04 LTS, OpenJDK 17, CA framework 
`provider.plugin=root`
   - Agent dir before addHost: only `environment.properties` + 
`log4j-cloud.xml` (i.e. the natural state right after `apt install 
cloudstack-agent` on a fresh host that has not yet been registered)
   
   ### steps to reproduce
   
   1. Install `cloudstack-management` 4.22.0.0 on a clean host. Bootstrap CA 
(default - `ca.framework.provider.plugin=root` works fine for this; not the 
issue).
   2. Install `cloudstack-agent` 4.22.0.0 on a clean KVM host. Verify 
`/etc/cloudstack/agent/` contains only `environment.properties` + 
`log4j-cloud.xml` (the package-shipped files).
   3. Stop `cloudstack-agent` on the KVM host (so no startup process can 
pre-create state).
   4. From the management server, dispatch `addHost` for the KVM host - e.g. 
via `cmk add host` or `ngine_io.cloudstack.cs_host`.
   5. Observe failure with `errortext: 'Could not add host at [http://...] ... 
due to: [ can't setup agent, due to 
com.cloud.utils.exception.CloudRuntimeException: Failed to setup certificate in 
the KVM agent's keystore file, please see logs and configure manually!]'`.
   6. SSH to the KVM host as root and inspect:
      - `/etc/cloudstack/agent/cloud.jks` exists (created by `keystore-setup`)
      - `/etc/cloudstack/agent/cloud.csr` does *not* exist on disk after the 
call (consumed by MS via stdout, then trailing `chmod` fails silently - this is 
a separate cosmetic issue)
      - `/etc/cloudstack/agent/agent.properties` does **not** exist - this is 
the bug
   7. Re-running `addHost` after manually creating `agent.properties` with a 
`keystore.passphrase=<any-non-empty-value>` line lets the second invocation 
succeed, because `keystore-setup`'s `if [ -f "$PROPS_FILE" ]` gate now passes 
and the existing passphrase is reused. This is a workable manual workaround per 
host.
   
   ### what to do about it
   
   **Primary - patch `scripts/util/keystore-setup` to ensure `PROPS_FILE` 
exists before the passphrase-write branch.** Two-block change, restructure the 
gate so the file is created if missing, then the passphrase is always either 
reused or written:
   
   ```diff
   -# Re-use existing password or use the one provided
   -if [ -f "$PROPS_FILE" ]; then
   -    $LOGGER_CMD "Previous props file exists, trying to extract password"
   -    OLD_PASS=$(sed -n '/^keystore.passphrase/p' "$PROPS_FILE" | sed 
's/^keystore.passphrase=//g')
   -    if [ -n "${OLD_PASS// }" ]; then
   -        KS_PASS="$OLD_PASS"
   -        $LOGGER_CMD "Password extraction successful"
   -    else
   -        sed -i "/^keystore.passphrase.*/d" "$PROPS_FILE" 2>&1 | $LOGGER_CMD 
|| true
   -        echo "keystore.passphrase=$KS_PASS" >> "$PROPS_FILE"
   -        if [ $? != 0 ]; then
   -                $LOGGER_CMD "Could not add new password to agent.properties"
   -        else
   -                $LOGGER_CMD "New keystore password set"
   -        fi
   -    fi
   -fi
   +# Ensure props file exists, then re-use existing passphrase or persist the 
one provided.
   +if [ ! -f "$PROPS_FILE" ]; then
   +    touch "$PROPS_FILE"
   +    chmod 600 "$PROPS_FILE"
   +    $LOGGER_CMD "Created new props file at $PROPS_FILE"
   +fi
   +
   +OLD_PASS=$(sed -n '/^keystore.passphrase/p' "$PROPS_FILE" | sed 
's/^keystore.passphrase=//g')
   +if [ -n "${OLD_PASS// }" ]; then
   +    KS_PASS="$OLD_PASS"
   +    $LOGGER_CMD "Password extraction successful"
   +else
   +    sed -i "/^keystore.passphrase.*/d" "$PROPS_FILE" 2>&1 | $LOGGER_CMD || 
true
   +    echo "keystore.passphrase=$KS_PASS" >> "$PROPS_FILE"
   +    if [ $? != 0 ]; then
   +        $LOGGER_CMD "Could not add new password to agent.properties"
   +    else
   +        $LOGGER_CMD "New keystore password set"
   +    fi
   +fi
   ```
   
   After this, fresh-host KVM `addHost` works first try. Re-runs are idempotent 
(existing passphrase reused).
   
   **Secondary - harden `scripts/util/keystore-cert-import` to fail loudly 
instead of `exit 0` in error branches.** Same script has bare `exit` (no 
status) in two error branches, which return exit code 0 and let the MS see 
`SSHCmdResult.isSuccess()=true` even when nothing was imported:
   
   ```diff
    # Import certificate
    if [ ! -z "${CERT// }" ]; then
        echo "$CERT" > "$CERT_FILE"
    elif [ ! -f "$CERT_FILE" ]; then
   -    echo "Cannot find certificate file: $CERT_FILE, exiting"
   -    exit
   +    echo "Cannot find certificate file: $CERT_FILE, exiting" >&2
   +    exit 1
    fi
   
    # Import ca certs
    if [ ! -z "${CACERT// }" ]; then
        echo "$CACERT" > "$CACERT_FILE"
    elif [ ! -f "$CACERT_FILE" ]; then
   -    echo "Cannot find ca certificate file: $CACERT_FILE, exiting!"
   -    exit
   +    echo "Cannot find ca certificate file: $CACERT_FILE, exiting!" >&2
   +    exit 1
    fi
   ```
   
   `exit 1` (with status) lets `SSHCmdResult.isSuccess()` correctly return 
false, and the existing 3-retry-then-throw loop in `setupAgentSecurity()` will 
surface the real failure to the operator instead of leaving the agent with a 
broken keystore that only manifests later as runtime TLS handshake errors.
   
   ### relationship to #13064
   
   This is a **sibling defect**, not the same one as #13064 - same script 
family, same general class of silent-failure footguns, but a different trigger 
and a different downstream symptom:
   
   | | this issue | #13064 |
   |---|---|---|
   | Hypervisor | KVM | VMware |
   | Triggered by | fresh KVM agent in `addHost` (no prior `agent.properties`) 
| VMware SSH dispatch of `SetupCertificateCommand` to a SystemVM |
   | Buggy script | `keystore-setup` (passphrase-write gated on file existing) 
| `keystore-cert-import` (bare `exit` in error branch) |
   | Args sent by MS | all 5 / 10 args correct | only 6 of 10 args (`$7`/`$8` 
CACERT missing on the VMware path) |
   | Exit code from script | exit 1 (loud failure) | exit 0 (silent success 
masking real failure) |
   | Downstream symptom | `setupAgentSecurity()` throws "Failed to setup 
certificate in the KVM agent's keystore file" | agent's `cloud.jks` ends up 
missing the trustedCertEntry; agent later fails TLS handshake to MS:8250 with 
`certificate_unknown` |
   | Surface time | immediate, on first `addHost` | minutes later, once the 
agent actually tries to connect |
   
   The proposed Secondary fix above (replacing bare `exit` with `exit 1`) 
addresses one half of #13064 - once those branches return non-zero, the MS 
`setupAgentSecurity()` retry-and-throw loop will surface the VMware-dispatch 
arg-count bug as a proper failure instead of letting the silent-success 
continue to hide it. The actual primary fix for #13064 (passing CACERT args 
from the VMware SSH dispatch path) is independent of this change and orthogonal.
   
   ### operator workaround (until upstream fix lands)
   
   Replace `/usr/share/cloudstack-common/scripts/util/keystore-setup` on each 
KVM agent with a thin wrapper that pre-creates `agent.properties` with the 
passphrase from arg `$3` before exec'ing the original. Originals preserved as 
`.orig`:
   
   ```bash
   #!/bin/bash
   # Workaround for upstream keystore-setup not creating agent.properties on
   # fresh hosts. See <this issue link>.
   if [ $# -ge 5 ]; then
       PROPS_FILE="$1"
       KS_PASS="$3"
       if [ ! -f "$PROPS_FILE" ]; then
           touch "$PROPS_FILE"
           chmod 600 "$PROPS_FILE"
       fi
       EXISTING=$(sed -n '/^keystore\.passphrase=/p' "$PROPS_FILE" | head -1 | 
sed 's/^keystore\.passphrase=//')
       if [ -z "${EXISTING// }" ]; then
           echo "keystore.passphrase=$KS_PASS" >> "$PROPS_FILE"
       fi
       exec /usr/share/cloudstack-common/scripts/util/keystore-setup.orig "$@"
   fi
   exec /usr/share/cloudstack-common/scripts/util/keystore-setup.orig "$@"
   ```
   
   Verified working on Ubuntu 24.04 + CloudStack 4.22.0.0 KVM - `addHost` 
succeeds first try after the wrapper is in place and the agent dir is wiped to 
clean state.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to