Package: nfs-common
Version: 1:1.3.4-2.1+deb9u1
Severity: important

Dear Maintainer,

FYI: /usr/share/bug/nfs-common/script warns of error.
  in my case:  'cat /etc/fstab|grep nfs >&3' returns 1 due to 'grep'
  fail (my nfs is all in autofs).  should probably '|| true' that.
  to avoid user confusion in bugreport.  same for other grep statements.

Note:
  This is a doozy -- a whole tower of fail (some my fault for
  implementing an earlier workaround for nfs deadlocks and forgetting
  it was there). So i'm not sure what package to report it against, but
  the core trigger is in /usr/sbin/start-statd so starting there. Yes,
  my system is likely configured in a pretty non-standard way, having
  an /etc/nfsmount.conf forcing nfsv3 to avoid a deadlock in earlier
  systems coming back to bite me as a deadlock in the future :-(

  other subsystems involved:
    - systemd
    - dbus-daemon
    - autofs

  ultimately, my goal here is to help establish a robust systemd-capable
  coordination in the various parts here to avoid another similar issue
  due to these inter-dependencies.  I don't know if 'start-statd' being
  re-written to take systemd state into account is the correct solution,
  but IMHO, systemd/dbus-daemon are utterly fragile in this situation
  and extremely difficult to debug (need systemctl to do stuff, but it
  won't work, and you can't restart dbus-daemon w/o systemctl, and kill
  -TERM on pid 1 doesn't work ...

Summary:
  - system configured for NFSv3 mounts via /etc/nfsmount.conf
    note: this was to workaround a bug in NFSv4.[012] that
    caused deadlocks against NFSv3 servers running Jessie.
    i do not recall the bug #
    something changed in a recent Stretch patchlevel as this
    was working fine up until i patched and rebooted.
  - systemd unit rpc-statd.service is disabled
  - automount/autofs -> nfs is called triggering start-statd
    that makes a 'systemctl start rpc-statd' that takes down
    dbus-daemon and never completes.
  - regardless of where the blame lies, it is possible that is wrong to
    call 'systemctl' from inside 'start-statd' *if* it's being called
    from a systemd unit itself.

  If system is configured for NFS v3 mounts via /etc/nfsmount.conf
  and systemctl unit 'rpc-statd' is disabled, then the automounter
  creates a chain in boot (at least in our system case) that forcibly
  tries to run 'systemctl start rpc-statd' via /usr/sbin/start-statd.

  This results in systemctl call not completing (i don't know if
  it's because systemctl calls can't be nested or called outside normal
  startup flow or what), and eventually dbus-daemon stops responding
  (so it could be a bug that needs to be transferred there).  this locks
  up the entire boot process.   systemctl calls all timeout.
  dbus-daemon is sitting in EAGAIN (resource temporarily unavailable)

  Additionally, i wasn't able to ssh in (even though systemd had started
  sshd) because of 'pam_motd' in /etc/pam.d/sshd calling update-motd,
  which also blocked hard and never completed and was uninterruptable.
  once i commented 'pam_motd' out, i could ssh in, and <CTRL>C something
  hanging on nfs to get a shell. (again, tower of fail)

  once in, if i killed the 'systemctl start rpc-statd', the system would
  return to responsiveness. (systemctl could again contact dbus-daemon)

  systemd-cgls showed:

  +-autofs.service
    | +-1453 /usr/sbin/automount --pid-file /var/run/autofs.pid
      | +-1465 /bin/mount -t nfs -s -o intr,nodev,nosuid
      ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
        | +-1466 /sbin/mount.nfs
        ral-local-linux:/exports/linux-amd64 /var/autofs/mnt/linux-amd64
        -s -o rw,nodev,nosuid,intr
          | +-1467 /bin/sh /usr/sbin/start-statd
            | -1470 systemctl start rpc-statd.service
                    ^^^^ this hangs dbus-daemon and brings down the
                    whole systemd kingdom.

  before it hung, ...
    puffin:/etc/default/grub.d# systemctl list-jobs
    TYPE  STATE  
    607 apt-daily.service              start running
    462 nfs-config.service             start running
    468 apt-daily-upgrade.service      start waiting
    460 rpc-statd-notify.service       start waiting
    453 rpc-statd.service              start waiting
    464 systemd-tmpfiles-clean.service start running



Note: 'ral-local-linux' is our NFS-shared /usr/local.  this may have
been triggered early due to 'cron' being started and user '@reboot' jobs
launching.

Note: i have a lot of systemd debug and other captured logs i can
provide if needed.

here's the /etc/nfsmount.conf that was being used prior:
        [ NFSMount_Global_Options ]
             nfsvers=3

Thanks,
--stephen

-- Package-specific info:
-- rpcinfo --
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100005    1   udp  48853  mountd
    100005    1   tcp  45675  mountd
    100005    2   udp  56398  mountd
    100005    2   tcp  58131  mountd
    100005    3   udp  49109  mountd
    100005    3   tcp  48261  mountd
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100227    3   tcp   2049
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100227    3   udp   2049
    100021    1   udp  54879  nlockmgr
    100021    3   udp  54879  nlockmgr
    100021    4   udp  54879  nlockmgr
    100021    1   tcp  41063  nlockmgr
    100021    3   tcp  41063  nlockmgr
    100021    4   tcp  41063  nlockmgr
    100007    2   udp    806  ypbind
    100007    1   udp    806  ypbind
    100007    2   tcp    807  ypbind
    100007    1   tcp    807  ypbind
    100024    1   udp  58391  status
    100024    1   tcp  34239  status
-- /etc/default/nfs-common --
NEED_STATD=
STATDOPTS=
NEED_IDMAPD=yes
NEED_GSSD=
-- /etc/idmapd.conf --
[General]
Verbosity = 0
Pipefs-Directory = /run/rpc_pipefs
[Mapping]
Nobody-User = nobody
Nobody-Group = nogroup
-- /etc/fstab --

-- System Information:
Debian Release: 9.13
  APT prefers oldstable-updates
  APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.9.0-14-amd64 (SMP w/24 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages nfs-common depends on:
ii  adduser              3.115
ii  init-system-helpers  1.48
ii  keyutils             1.5.9-9
ii  libc6                2.24-11+deb9u4
ii  libcap2              1:2.25-1
ii  libcomerr2           1.43.4-2+deb9u2
ii  libdevmapper1.02.1   2:1.02.137-2
ii  libevent-2.0-5       2.0.21-stable-3
ii  libgssapi-krb5-2     1.15-1+deb9u2
ii  libk5crypto3         1.15-1+deb9u2
ii  libkeyutils1         1.5.9-9
ii  libkrb5-3            1.15-1+deb9u2
ii  libmount1            2.29.2-1+deb9u1
ii  libnfsidmap2         0.25-5.1
ii  libtirpc1            0.2.5-1.2+deb9u1
ii  libwrap0             7.6.q-26
ii  lsb-base             9.20161125
ii  rpcbind              0.2.3-0.6
ii  ucf                  3.0036

Versions of packages nfs-common recommends:
ii  python  2.7.13-2

Versions of packages nfs-common suggests:
pn  open-iscsi  <none>
pn  watchdog    <none>

Versions of packages nfs-kernel-server depends on:
ii  init-system-helpers  1.48
ii  keyutils             1.5.9-9
ii  libblkid1            2.29.2-1+deb9u1
ii  libc6                2.24-11+deb9u4
ii  libcap2              1:2.25-1
ii  libsqlite3-0         3.16.2-5+deb9u3
ii  libtirpc1            0.2.5-1.2+deb9u1
ii  libwrap0             7.6.q-26
ii  lsb-base             9.20161125
ii  netbase              5.4
ii  ucf                  3.0036

-- Configuration Files:
/etc/default/nfs-common changed [not included]

-- no debconf information

Reply via email to