On Wed, May 12, 2021 at 3:52 AM Christopher James Halse Rogers <r...@ubuntu.com> wrote: > > Hello everyone, > > There's an nfs-utils SRU¹ hanging around waiting for a policy decision > on use of the After=network-online.target systemd unit dependency. I'm > not an expert here, but it looks like part of my SRU rotation today is > starting the discussion on this so we can resolve it one way or another!
Goodness this email thread has a lot of different directions. Just a few observations that might help: 1) what does it actually mean for systemd-networkd to consider networking 'online'? To be specific, 'network-online.target' simply calls 'systemd-networkd-wait-online' which has its own man page which is very descriptive about what exactly it waits for. To briefly summarize, It means all systemd-networkd managed interfaces that are 'required for online' have reached a setup state of 'configured' or 'failed' and at least one managed interface has reached operational state of 'degraded' or higher. Any interfaces that should not be required should have their .network file include 'RequiredForOnline=no' in their [LINK] section (see man systemd.network). The 'degraded' state of an interface means it has carrier and a valid local link address (the next step up is 'routable' which means it has a routable address configured). Note that systemd-networkd isn't the only provider of network management; NetworkManager also does and it also has a service implementing (or more accurately WantedBy) network-online.target, which is NetworkManager-wait-online.service. That very likely has a different definition of exactly what it means for the network to be 'online'. 2) what's the downside of something requiring network-online.target? The only downside is the delay of the service(s) that is/are configured with After=network-online.target. Any such service will be delayed at boot until the network manager (whatever it is) decides the network is "up" (as mentioned above). However, that of course also delays any services which order themselves after the delayed service(s). To the end user, this typically is seen as a 'hang' during boot. The specific reason is there are services/targets that order themselves after network-online.target, that also are ordered before services that provide user login. In a default cloud image system, the specific packages that introduce this problem are cloud-init and open-iscsi. For example here is the startup plot of a plain hirsute cloud-init vm, with the only modification being adding a second interface (with no connection to anything) and adding systemd-networkd config to start dhcp on the second interface (which of course will delay the network starting since the dhcp will never get an answer). You can see that systemd-user-sessions is delayed until after network-online.target, which 'hangs' the boot: https://people.canonical.com/~ddstreet/startups/startup-plain.svg And here is the same vm, with cloud-init and open-iscsi removed. Note that network-online.target isn't in the units started at boot, so there is no delay for anything. https://people.canonical.com/~ddstreet/startups/startup-without-cloud-init-open-iscsi.svg And again, but with a simple service 'dummy.service' that does nothing and has Wants=network-online.target and WantedBy=multi-user.target (this service pulls network-online.target into the units started at boot). This shows that systemd-user-sessions isn't delayed, and so login is not delayed and there is no 'hang' during boot, but the network-online target is delayed, as expected; it just has no impact on how long boot takes to reach user login. https://people.canonical.com/~ddstreet/startups/startup-without-cloud-init-open-iscsi-with-network-online.svg Finally to illustrate the boot ordering problems that open-iscsi introduces, the dummy.service is changed to want network-online.target and remote-fs-pre.target, and order itself between those, just as open-iscsi does (specifically, After=network-online.target and Before=remote-fs-pre.target): https://people.canonical.com/~ddstreet/startups/startup-without-cloud-init-open-iscsi-dummy-delay.svg Note that this isn't *necessarily* a bug in open-iscsi, as it kind of makes sense; if user login does in fact require an iscsi-mounted directory, then systemd-user-sessions should be ordered after open-iscsi, and of course open-iscsi requires networking to work. However, there clearly is subtlety in the reality of the dependency chain that the current implementation doesn't have, for example even if there are no iscsi mounts at all, open-iscsi adds this boot ordering that delays user login until after network-online. The cloud-init package introduces a similar ordering, but is much more blunt about it; the cloud-init.service includes After=systemd-network-wait-online.service and Before=systemd-user-sessions.service. To clarify, *any* package with systemd services/targets might introduce unit ordering similar to this at boot time, so this isn't necessarily just a problem added by open-iscsi and cloud-init, I just used those packages for examples since they're included in the cloud images by default. 3) So just adding After=network-online.target will cause the delay if the network doesn't start up? No, as shown in the second plot above, the network-online.target is not part of the boot units by default, and it will only cause a delay if some (enabled) service or target actually Wants= it. If no (enabled) service/target Want=network-online.target, then it doesn't matter how many services order themselves after it with After=network-online.target, it won't be considered during boot and there will be no delay (due to networking). Additionally, by default the 'user login' stage of boot isn't ordered after network-online, so it also requires some service or target to introduce ordering between network-online and systemd-user-session, similar to what open-iscsi and cloud-init do. 4) Should nfs-utils use network-online.target? IMHO yes, but it should be careful about its overall ordering of services. For example it shouldn't introduce a boot ordering dependency on the network if no NFS mounts are defined. 5) Is this a systemd problem? IMHO I don't think systemd is doing anything wrong by providing network-online; however as some people mentioned in this thread, network-online.target is about the most blunt instrument you could think of, and there certainly seems to be a need for systemd to provide more fine-grained network dependency controls. That's certainly something that could be proposed/discussed with upstream systemd. However it may also be something that is only implemented if using systemd-networkd for networking. I have no idea how NetworkManager does anything, or even if it properly implements NetworkManager-wait-online.service in the same way systemd-networkd does. > > I am not an expert in this area, but as I understand it, the tradeoff > here is: > 1. Without a dependency on After=network-online.target there is no > guarantee that the network interface(s) will be usable at the time the > nfs-utils unit triggers, and nfs-utils will fail if the relevant ntwork > interface is not usable, or > 2. With a dependency on After=network-online.target nfs-utils will > reliably start, but if there are any interfaces which are configured > but do not come up this will result in the boot hanging until the > timeout is hit. > > In mitigation of (2), there are apparently a number of default packages > which already have a dependency on After=network-online.target, so boot > hanging if interfaces are down is the status quo? > > The obvious thing to do here would be to follow Debian, but as far as I > can tell there is not currently a Debian policy about this - the best I > can find is an ancient draft of a best-practises-guide² suggesting > that pacakages SHOULD handle networking dynamically, but if they do not > MUST have a dependency on After=network-online.target > > As far I understand it, handling networking dynamically requires > upstream code changes (although maybe fairly simple code changes?). > > It seems unlikely that, whatever we decide, we'll immediately do a full > sweep of the archive and fix everything, so it looks like our choice is > between: > > 1. The long-term goal is to have no After=network-online.target > dependencies in default boot (stretch goal: in main). Whenever we run > into a package-fails-if-network-is-not-yet-up bug, we patch the code > and submit upstream. Over time we audit existing users of > After=network-online.target and patch them for dynamic networking, as > time permits. > > 2. We don't expect to be able to reach no After=network-online.target > dependencies in the default boot, so it's not a priority to avoid them. > Whenever we run into a package-fails-if-network-is-not-yet-up bug, we > add an After=network-online.target dependency. > > Option (1) seems to be the technically superior option (and is > recommended by systemd upstream³), but appears to require more work. I > have limited insight into how much work that would be; someone from > Foundations or Server probably needs to weigh in on that. > > Option (2) seems to be formalisation of the status-quo, so would seem > to be less work. > > Let the discussion begin! > > ¹: https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/1918141 > ²: > https://github.com/ajtowns/debian-init-policy/blob/master/systemd-best-practices.pad > ³: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ > > > > -- > ubuntu-devel mailing list > ubuntu-devel@lists.ubuntu.com > Modify settings or unsubscribe at: > https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel -- ubuntu-devel mailing list ubuntu-devel@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel