Trying to set diskless(8) -- hanging in "RPC timeout for server"

Stefan Unterweger Mon, 10 May 2010 16:51:58 -0700

Hello!

I'm trying to set up my server for diskless boots, as described
in the diskless(8) manpage (at the moment, more or less mostly as
an academic exercise, but I was planning to take my oldish
laptops to some use this way).


I went along the instructions from the manpage, setting up the
various pieces as I was instructed; since I was already running
a limited PXE boot environment so that I can do installs more
rapidly, many of the steps were already done, having to setup
only rarpd and nfs.

However, when I now try to get the client actually to boot from
this setup, it fails quite miserably when trying to mount the
root filesystem via NFS. The kernel just hangs forever, printing
"RPC timeout for server 172.23.255.255 (0xac17ffff) prog 100000".

After some research, I came up with an old posting from misc
(http://archives.neohapsis.com/archives/openbsd/2004-01/0603.html),
but without any solution. The problem described there is quite
similar to the one I'm experiencing here, but without all the
peculiarities that were used there (i.e., I'm using a stock
4.6-release, stock-dhcpd, stock-everything). Especially, my
client does the same thing as the Soekris in that old posting,
i.e. trying to connect to the NFS server at the broadcast address
172.23.255.255, instead of 172.23.12.2, which would be the "real"
public address of the server. It _does_ connect to 172.23.12.2 on
the original PXE bootstrap, but that might as well be because
dhcpd tells it to do so, as far as I understood the process.

Since the server also runs some other services, pf is running,
which I first guessed might be the culprit. However, even with
"pass quick" for everything coming from the particular client,
nothing changes. tcpdump on the pflog-interface shows the sunrpc
packets to be allowed, so I don't think that it is a PF issue.
Disabling PF didn't change anything, for that matter.

rpcinfo(8) shows everything up and running:
| % rpcinfo -p
|    program vers proto   port
|     100000    2   tcp    111  portmapper
|     100000    2   udp    111  portmapper
|     100003    2   udp   2049  nfs
|     100003    3   udp   2049  nfs
|     100003    2   tcp   2049  nfs
|     100003    3   tcp   2049  nfs
|     100021    0   udp    759  nlockmgr
|     100021    1   udp    759  nlockmgr
|     100021    3   udp    759  nlockmgr
|     100021    4   udp    759  nlockmgr
|     100021    1   tcp    776  nlockmgr
|     100021    3   tcp    776  nlockmgr
|     100021    4   tcp    776  nlockmgr
|     100024    1   udp    992  status
|     100024    1   tcp    726  status
|     100005    1   udp    994  mountd
|     100005    3   udp    994  mountd
|     100005    1   tcp   1011  mountd
|     100005    3   tcp   1011  mountd

Especially the portmapper itself, as this one seems to be the
service that the client seems unable to find. Or at least, that's
how I interpret the "prog 100000" which scrolls continuously on
the client's error message.

I have already tried to have tcpdump have a look at what's going
on, but unfortunately, I don't see very much in its output:
| $ tcpdump -n -s 140 -i em0 host 172.23.13.138
| tcpdump: listening on em0, link-type EN10MB
| 01:29:31.853178 172.23.13.138.718 > 172.23.255.255.111: udp 96
| 01:29:36.853392 172.23.13.138.718 > 172.23.255.255.111: udp 96
| 01:29:41.853479 172.23.13.138.718 > 172.23.255.255.111: udp 96
(ad infinitum)

As far as I see it, the client sends some UDP packet to the
portmapper, but does not get any response.

Since it looks like a RPC/NFS issue, I tried to see if "normal"
NFS access would yield similar issues, so I had the same client
try to connect from some Linux livecd thingie. This succeeded on
the first try---hence, NFS seems to work, at least in general.
However, the straightforward nfs mount did connect using
172.23.13.2 (i.e., the "real" address of the server"), not the
broadcast address. Trying to do a mount to
172.23.255.255:/export/client resulted in an error message,
namely "Network is unreachable", but no blip comes up at the
tcpdump above which was still running at this time, so it might
as well have been Linux who won't allow to connect NFS on
the broadcast address.

The previously mentioned old mailinglist posting mentioned that
rpc.bootparamd'd be needed, but starting it or not does not make
any difference (and http://www.netbsd.org/docs/network/netboot/intro.i386.html
kind of implies that rpc.bootparamd is not needed on i386, and
the manpage actively discourages it).


I'm now quite at a loss now, and don't know where to look
anymore. I'm sure it's just some small thing that I'm still
overlooking, or some interoperatibility issue with some parts of
that setup, but I don't know where to look anymore.

Thanks in advance for any hints, or for just having the patience
to read through to the end. :o)

s//un

Trying to set diskless(8) -- hanging in "RPC timeout for server"

Reply via email to