On Fri, Feb 21, 2025 at 12:13 AM Toomas Soome <tso...@me.com> wrote:
>
>
>
> On 21. Feb 2025, at 04:39, Rick Macklem <rick.mack...@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 4:28 PM Steve Rikli <s...@genyosha.net> wrote:
>
>
> On Wed, Feb 19, 2025 at 02:40:15PM -0800, Rick Macklem wrote:
>
>
> The subject line basically describes the problem glebius@
> ran into.  When doing an NFS mount in /etc/fstab, it failed
> since the DNS service was not yet working and, as such,
> the DNS lookup of the server fqdn failed, causing the mount
> to fail. Note that this behaviour has existed for decades.
>
> He feels this is a bug and that mount_nfs(8) should retry
> getaddrinfo(3) calls until success, instead of failing the
> mount when the first attempt fails.
> The problem with just retrying getaddrinfo(3) is that it
> could retry forever for simple failures like a typo in the
> server fqdn.
> I can see several ways this can be handled and would
> like feedback from others w.r.t. these alternatives.
>
> 1) Simply document this case and encourage use of
>    host names in /etc/hosts for NFS servers along with
>    specifying use of file before dns in nsswitch.conf.
>     Doing this results in the mounts working whether or
>      not DNS is working.
>
> 2) Call it a bug and patch mount_nfs(8) to retry getaddrinfo(3)
>     until it succeeds. (I feel this would be a POLA violation,
>     given that the current behaviour has existed for decades
>     and for simple cases where the fqdn will never resolve
>     the behaviour would be to hang at the mount attempt
>     during boot unless "bg" is specified for the /etc/fstab entry.)
>
> 3) Add a new NFS mount option "retrydns=<N>", which would enable
>    retries of getaddrinfo(3). This would avoid any POLA violation and
>    would allow for a convenient way to document the behaviour in
>    "man mount_nfs".
>
> 4) ???
>
> So, what do you think is the preferred change?
>
>
> I don't think I would change mount_nfs code behavior for this.
>
> That is, requiring services and daemons etc. to workaround missing,
> misconfigured, slow, or misbehaving nameservice (whether it's DNS,
> /etc/hosts, NIS, whatever) seems like more complexity, possibly not
> effective, and maybe not focused on the right thing.
>
> Now, without meaning to be presumptuous, it may be worth re-examining
> the startup sequence, e.g. to make sure NFS mounts are tried after the
> known dependencies can reasonably be expected to have started, including
> the network, plus local_unbound or bind (if used), possibly others.
>
> After a quick look, I don't see an obvious problem with the sequence,
> but more knowledgeable eyes than mine are welcome.  I don't quite follow
> some of the output from rcorder and service -r.
>
> ps: I looked and the return value from getaddrinfo(3) does not
>      appear to be useful to discern the case of "DNS service not
>      running yet". (I think it replies EAI_FAIL for this case.)
>
>
> In that area, I'll note FreeBSD rc.d has a "NETWORKING" dependency for
> PROVIDE and REQUIRE, and it's included in scripts like nfsclient,
> mountcritremote et al. However there seems to be no similar dependency
> for something like "NAMESERVICE" (generic, as opposed to "named"
> specifically), and I'm not sure how that might be implemented, even
> assuming it could be useful in a situation like this.
>
> I.e. there are many things to potentially check for "can the system
> resolve hostnames yet", and not all of them involve running a local
> instance of named, unbound, etc.
>
> In general, if I were running into problems with nameservice not being
> available by the time NFS mounts happen, I think I'd start by looking
> into possible nameservice issues, then check out some mechanisms other
> folks have mentioned (fstab IP addresses or late option, rc.conf
> netwait_enable, etc.) rather than coding workarounds into NFS itself.
>
> Well, the patch I have created (it took about 15min) only changes behaviour
> if a new "retrydns" option i used. As such, I think it might be useful for 
> some,
> but doesn't change things unless someone uses it.
>
> I agree with you that I don't think the rc scripts have a way to check REQUIRE
> dns working. (I, personally, always put the fqdn for NFS servers in /etc/hosts
> and make sure "files" is first in nsswitch.conf, but others argue that is not
> feasible for some deployments. (Using IP numbers works for AUTH_SYS,
> but not Kerberized mounts.)
>
> Note that there is already "retrycnt", which specifies retry the mount,
> but that retry loop doesn't include getaddrinfo(3) calls.
> --> Personally, I do not like always doing retries since I often
>     type mount commands manually and I'm a terrible typist, so I
>     often mistype the server's name.
>
> This reply was mostly a followup on all the good comments and
> not just yours.
>
> Thanks everyone, for your comments, rick
>
>
> my 2cents:
>
> there is a difference of name service not responding and name not resolving.
Agreed. Unfortunatey, the return values for getaddrinfo(3) do not clearly
differentiate between them. I think Gleb's case returns EAI_FAIL, which is also
returned for other failures. I think EAI_NONAME is returned for the case where
the resolver (dns, /etc/hosts or ???) does determine that the name is bogus.

I suppose the code could do retries for all return values other than EAI_NONAME,
but to me that would still be a POLA violation, since the current
behaviour has been
in place for decades (as others have noted).  Also, some of the
feedback here has
been "It is not broken, don't fix it", if I interpreted it correctly.
A new option avoids
changing the default behaviour.

> In first case, it will go to:
>
>              bg      If an initial attempt to contact the server fails, fork
>
>                      off a child to keep trying the mount in the background.
>
>                      Useful for fstab(5), where the file system mount is not
>
>                      critical to multiuser operation.
>
>
>              bgnow   Like bg, fork off a child to keep trying the mount in the
>
>                      background, but do not attempt to mount in the foreground
>
>                      first.  This eliminates a 60+ second timeout when the
>
>                      server is not responding.  Useful for speeding up the
>
>                      boot process of a client when the server is likely to be
>
>                      unavailable.  This is often the case for interdependent
>
>                      servers such as cross-mounted servers (each of two
>
>                      servers is an NFS client of the other) and for cluster
>
>                      nodes that must boot before the file servers.
>
>
> in second case, its a failure you can not recover from.
The only difference between "bg" and "bgnow" is whether or not the
mount gows into background right away or after the first failed attempt.
A name resolver failure still terminates the mount attempt, for both of
these. "bg" does fix it for Gleb, but I think that is just timing.
(Retries are of the actual NFS mount, assuming the NFS server is also
booting and has not yet come up. Cross mounts between multiple systems
is messy.

Gleb also notes a different behaviour when "late" is used. I have not yet
investigated this one

rick

>
>
> rgds,
>
> toomas
>
>
>
>
>

Reply via email to