Hi all, we are seeing rpc.mountd crashes on our Red Hat EL4 systems. We have tracked down the bug and it seems to be still present in the current nfs-utils source.
We are making extensive use of netgroups for NFS exports. On a large file server with hundreds of home directories we export every directory to a unique netgroup. Member netgroups are used to export to sets of machines. The following example illustrates what we do: # cat /etc/exports /export/home/jane @jane(async,rw,no_subtree_check,fsid=10000) /export/home/joe @joe(async,rw,no_subtree_check,fsid=10001) # cat /etc/netgroup lab_1 (workstation1,,) (workstation2,,) (workstation3) offices_1 (workstation4,,) (workstation5,,) jane lab_1 offices_1 joe offices_1 (joeslaptop,,) We do this on a much larger scale though. The bug we ran into is in line 96 in utils/mountd/auth.c. The strcpy can corrupt memory when it copies the string returned by client_compose() to my_client.m_hostname which has a fixed size of 1024 bytes. For our example above, client_compose() returns "@joe,@jane" for any machine in the offices_1 netgroup. Unfortunately we have a machine to which roughly 150 netgroups like @joe or @jane export to and client_compose() returns a string over 1300 bytes long and rpc.mountd nicely segfaults. To prevent the crash is of course trivial: Inserting a simple 'if (strlen(n) > 1024) return NULL;' before line 96 does the job. There are however two issues for which we could not find an easy solution: 1. For every client rpc.mountd and the kernel seem to exchange and use lists with _all_ netgroups used in exports that are relevant for granting permission to some share for a particular client. We could imagine two optimizations here: * Resolve netgroups and only put the (member) netgroups that contained the host name that would be used to authorize a mount in the list. * Use the list of mounted paths per client and only put the netgroup(s) used to export paths that are actually mounted on a client. This also caused us severe performance problems because rpc.mountd queries all these netgroups. We were initially using a LDAP and mouting a directory took up to ten seconds during which rpc.mountd was busily querying the LDAP server. We got this down to two seconds using file based netgroups. 2. Using a fixed size for NFSCLNT_IDMAX does not scale. Mounting shares on a client for which the 'if' clause of the quick fix becomes true will not be possible. We thought about enlarging NFSCLNT_IDMAX and using a custom kernel but dropped the idea. Our ultimate goal is to get Red Hat fix the code in nfs-utils 1.0.6 that is used in RHEL4. A first step would be to get a suitable fix in the current nfs-utils. Is there somebody on the mailing list who could see an easy fix or would have an opinion on how to best address the issues we see? Thanks in advance and best regards, Stefan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/