** Description changed:

+ [Impact]
+ 
+ cloud-init makes DNS queries early in boot, before systemd-networkd is
+ active. The query causes systemd-resolved to query registered hooks,
+ which then socket-activates systemd-networkd. But, this causes a boot
+ delay, because systemd-networkd cannot start until the relevant cloud-
+ init phase has completed.
+ 
+ Hence, the user impact is a delay in boot on certain clouds.
+ 
+ [Test Plan]
+ 
+ TODO
+ 
+ [Where problems could occur]
+ 
+ The fix is to order systemd-networkd-resolve-hook.socket After=network-
+ pre.target. This avoids the situation where the socket for the resolve
+ hook is bound before systemd-networkd could actually be started.
+ 
+ Hence, problems introduced by this would most likely be related to unit
+ ordering or dependency problems.
+ 
+ [Other info]
+ 
+ The fix comes from upstream: systemd-networkd-resolve-hook.socket
+ 
+ [Original Description]
+ 
  `cloud-init`: `26.1-0ubuntu1`
  `systemd`: `259.5-0ubuntu2`
  `datasource`: `DataSourceGCELocal`
  `interface`: `ens4`
  DHCP used by `cloud-init`: `dhcpcd 10.3.0`
  
  Confirmed clouds affected: Google cloud
  
  On dev resolute images in GCE first boot is being delayed by about ~30s
  during `cloud-init`'s local datasource discovery. Specifically when
  `/etc/resolv.conf` is symlinked to the stub-resolver (`/etc/resolv.conf
  -> ../run/systemd/resolve/stub-resolv.conf`) which is the default.
  
  On initial boot `cloud-init analyze blame` shows that `init-
  local/search-GCELocal` takes about ~31s and `/var/log/cloud-init.log`
  shows that although `DHCP` completes quickly, the `DNS` resolution of
  `metadata.google.internal` takes about 30s before the first metadata
  request is sent. The same slow boot’s logs also show that `systemd-
  resolved` times out on `io.systemd.Network` resolve hook, and that
  `systemd-networkd.service` doesn't start until after `cloud-init-
  local.service` *finishes*.
  
  To test the theory I manually mounted a vanilla image (I had to do it
  like this because `livecd-rootfs` forces a specific /etc/resolv.conf
  file on you...) and edited it so that `/etc/resolv.conf` pointed to the
  non-stubbed `systemd-resolved` file instead (`/etc/resolv.conf ->
  ../run/systemd/resolve/resolv.conf`). This reduced first boot `init-
  local/search-GCELocal` back down to about ~2s and reduced the metadata
  `hostname` resolution from about 30s to ~0.03s.
  
  This indicates to me that the boot delay is caused by `systemd-
  resolved`’s stub resolver path interacting badly with the
  `io.systemd.Network` resolve hook path in `systemd` v259 (?) during
  `cloud-init-local`.
  
  ## Longer explanation/moar logs
  
  `cloud-init status` showed that `cloud-init` completed successfully with
  no reported errors:
  
  ```
  $ sudo cloud-init status --long
  status: done
  extended_status: done
  boot_status_code: enabled-by-generator
  detail: DataSourceGCELocal
  errors: []
  recoverable_errors: {}
  ```
  
  but that most of the time is spent in the `local` GCE `datasource` search:
  ```
  $ sudo cloud-init analyze blame
  -- Boot Record 01 --
       30.98900s (init-local/search-GCELocal)
       06.80700s (modules-config/config-apt_configure)
       01.62200s (modules-config/config-ntp)
       01.09700s (modules-final/config-keys_to_console)
       00.94900s (init-network/config-growpart)
       00.91800s (modules-final/config-ssh_authkey_fingerprints)
       00.58400s (init-network/config-ssh)
       ...
       ...
  ```
  
  `cloud-init analyze` shows:
  ```
  $ sudo cloud-init analyze show
  -- Boot Record 01 --
  
  Starting stage: init-local
  |`->no cache found (duration: 0.000s) @00.01300s +00.00000s
  |`->found local data from DataSourceGCELocal (duration: 30.989s) @00.02700s 
+30.98900s
  Finished stage: (init-local) 31.93500 seconds
  ```
  
  ... but an important finding from `/var/log/cloud-init.log` (see below) is 
that:
  * DHCP was quick
  * routes were added immediately
  * the large stall comes from `hostname` resolution
  * and once resolution finished the metadata requests themselves succeed 
straight away
  ```
  2026-04-15 11:16:18,096 - dhcp.py[DEBUG]: Performing a dhcp discovery on ens4
  2026-04-15 11:16:18,978 - performance.py[DEBUG]: Running ['dhcpcd', 
'--ipv4only', '--waitip', '--persistent', '--noarp', '--debug', 
'--script=/bin/true', 'ens4'] took 0.869 seconds
  2026-04-15 11:16:18,984 - dhcp.py[DEBUG]: Parsing dhcpcd lease for interface 
ens4: 'reason=BOUND
  ...
  ...
  domain_name_servers=169.254.169.254
  ...
  ...
  dhcp_server_identifier=169.254.169.254
  ...
  ...
  2026-04-15 11:16:18,988 - ephemeral.py[DEBUG]: Received dhcp lease on ens4 
for 10.164.0.17/255.255.255.255
  2026-04-15 11:16:18,992 - subp.py[DEBUG]: Running command ['ip', '-4', 
'route', 'append', '10.164.0.1/32', 'dev', 'ens4'] ...
  2026-04-15 11:16:18,994 - subp.py[DEBUG]: Running command ['ip', '-4', 
'route', 'append', '0.0.0.0/0', 'via', '10.164.0.1', 'dev', 'ens4'] ...
  2026-04-15 11:16:49,030 - performance.py[DEBUG]: Resolving URL took 30.033 
seconds
  2026-04-15 11:16:49,031 - url_helper.py[DEBUG]: [0/6] open 
'http://metadata.google.internal/computeMetadata/v1/instance/id' ...
  2026-04-15 11:16:49,039 - url_helper.py[DEBUG]: Read from 
http://metadata.google.internal/computeMetadata/v1/instance/id (200, 18b) after 
1 attempts
  2026-04-15 11:16:49,041 - url_helper.py[DEBUG]: Read from 
http://metadata.google.internal/computeMetadata/v1/instance/zone (200, 41b) 
after 1 attempts
  2026-04-15 11:16:49,044 - url_helper.py[DEBUG]: Read from 
http://metadata.google.internal/computeMetadata/v1/instance/hostname (200, 55b) 
after 1 attempts
  ...
  ...
  2026-04-15 11:16:49,063 - performance.py[DEBUG]: Getting metadata took 30.988 
seconds
  ```
  
  `journalctl` shows the resolve hook timeout and the ordering of 
`cloud-init-local` vs. `systemd-networkd` more clearly:
  ```
  $ sudo journalctl -b 0 -o short-monotonic \
      -u cloud-init-local.service \
      -u systemd-resolved.service \
      -u systemd-networkd.service
  
  [    3.494233] localhost systemd[1]: Starting systemd-resolved.service - 
Network Name Resolution...
  [    3.644166] localhost systemd[1]: Started systemd-resolved.service - 
Network Name Resolution.
  [    8.555137] localhost systemd[1]: Starting cloud-init-local.service - 
Cloud-init: Local Stage (pre-network)...
  [   39.023176] resolute-quick systemd-resolved[196]: System hostname changed 
to 'resolute-quick'.
  [   39.124219] resolute-quick systemd-resolved[196]: Query on hook 
'/run/systemd/resolve.hook/io.systemd.Network' failed with error 
'io.systemd.TimedOut', ignoring.
  [   39.124478] resolute-quick systemd-resolved[196]: Query on hook 
'/run/systemd/resolve.hook/io.systemd.Network' failed with error 
'io.systemd.TimedOut', ignoring.
  [   39.658145] resolute-quick systemd[1]: Finished cloud-init-local.service - 
Cloud-init: Local Stage (pre-network).
  [   39.693138] resolute-quick systemd[1]: Starting systemd-networkd.service - 
Network Management...
  [   39.804399] resolute-quick systemd[1]: Started systemd-networkd.service - 
Network Management.
  ```
  In other words:
  * `cloud-init-local` starts and needs to resolve `metadata.google.internal`
  * with `stub-resolv.conf` that lookup goes through `systemd-resolved`
  * `systemd-resolved` attempts to use the io.systemd.Network resolve hook, but 
that hook times out after 30 seconds
  * and finally `systemd-networkd` only starts after `cloud-init-local` has 
finished
  
  The ordering and socket relationships make this seem plausible:
  * `cloud-init-local.service` has `Wants=network-pre.target` && 
`Before=network-pre.target`
  * `systemd-networkd.service` has `After=network-pre.target` && `... 
systemd-networkd-resolve-hook.socket`
  * `systemd-networkd-resolve-hook.socket` listens on 
`/run/systemd/resolve.hook/io.systemd.Network`
  
  ### When using an image with the modified symlink
  
  On a fresh boot of the modified image:
  
  ```
  $ cloud-init analyze blame
  -- Boot Record 01 --
       03.94800s (modules-config/config-ntp)
       02.19300s (init-local/search-GCELocal)
       01.29000s (modules-final/config-ssh_authkey_fingerprints)
       00.88100s (init-network/config-growpart)
       00.67000s (modules-final/config-keys_to_console)
       00.41300s (modules-config/config-apt_configure)
       ...
       ...
  ```
  and
  ```
  $ cat /var/log/cloud-init.log
  ...
  ...
  64:2026-04-15 20:10:24,412 - performance.py[DEBUG]: Resolving URL took 0.031 
seconds
  ...
  ...
  ```

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2148619

Title:
  `cloud-init` local datasource discovery takes ~30s to complete

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/2148619/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to