Peter Eriksson wrote: >I once reported that we had a server with many thousands (typically 23000 or >so >per server) ZFS filesystems (and 300+ snapshots per filesystem) where >mountd >was 100% busy reading and updating the kernel (and while doing that >holding the >NFS lock for a very long time) every hour (when we took snapshots >of all the >filesystems - the code in the zfs commands send a lot of SIGHUPs >to mountd it >seems)…. > >(Causing NFS users to complain quite a bit) > >I have also seen the effect that when there are a lot of updates to >filesystems that >some exports can get “missed” if mountd is bombarded with >multiple SIGHUPS - >but with the new incremental update code in mountd this >window (for SIGHUPs to >get lost) is much smaller (and I now also have a >Nagios check that verifies that all >exports in /etc/zfs/exports also is >visible in the kernel). I just put a patch up in PR#246597, which you might want to try.
>But while we had this problem it I also investigated going to a DB based >exports >“file” in order to make the code in the “zfs” commands that read and >update >/etc/zfs/exports a lot faster too. As Rick says there is room for >_huge_ >improvements there. > >For every change of “sharenfs” per filesystem it would open and read and >parse, >line-by-line, /etc/zfs/exports *two* times and then rewrite the whole >file. Now >imagine doing that recursively for 23000 filesystems… My change to >the zfs code >simple opened a DB file and just did a “put” of a record for the >filesystem (and >then sent mountd a SIGHUP). Just to clarify, if someone else can put Peter's patch in ZFS, I am willing to put the required changes in mountd. > >(And even worse - when doing the boot-time “zfs share -a” - for each >filesystem it >would open(/etc/zfs/exports, read it line by line and check to >make sure the >filesystem isn’t already in the file, then open a tmp file, >write out all the old >filesystem + plus the new one, rename it to >/etc/zfs/exports, send a SIGHUP) and >the go on to the next one.. Repeat. >Pretty fast for 1-10 filesystems, not so fast for >20000+ ones… And tests the >boot disk I/O a bit :-) > > >I have seen that the (ZFS-on-Linux) OpenZFS code has changed a bit regarding >>this and I think for Linux they are going the route of directly updating the >kernel >instead of going via some external updater (like mountd). The problem here is NFSv3, where something (currently mountd) needs to know about this stuff, so it can do the Mount protocol (used for NFSv3 mounting and done with Mount RPCs, not NFS ones). >That probably would be an even better way (for ZFS) but a DB database might be >>useful anyway. It’s a very simple change (especially in mountd - it just >opens the >DB file and reads the records sequentially instead of the text >file). I think what you have, which puts the info in a db file and then SIGHUPs mountd is a good start. Again, if someone else can get this into ZFS, I can put the bits in mountd. Thanks for posting this, rick ps: Do you happen to know how long a reload of exports in mountd is currently taking, with the patches done to it last year? - Peter On 2 Jun 2020, at 06:30, Rick Macklem <rmack...@uoguelph.ca<mailto:rmack...@uoguelph.ca>> wrote: Rodney Grimes wrote: Hi, I'm posting this one to freebsd-net@ since it seems vaguely similar to a network congestion problem and thought that network types might have some ideas w.r.t. fixing it? PR#246597 - Reports a problem (which if I understand it is) where a sighup is posted to mountd and then another sighup is posted to mountd while it is reloading exports and the exports are not reloaded again. --> The simple patch in the PR fixes the above problem, but I think will aggravate another one. For some NFS servers, it can take minutes to reload the exports file(s). (I believe Peter Erriksonn has a server with 80000+ file systems exported.) r348590 reduced the time taken, but it is still minutes, if I recall correctly. Actually, my recollection w.r.t. the times was way off. I just looked at the old PR#237860 and, without r348590, it was 16seconds (aka seconds, not minutes) and with r348590 that went down to a fraction of a second (there was no exact number in the PR, but I noted milliseconds in the commit log entry. I still think there is a risk of doing the reloads repeatedly. --> If you apply the patch in the PR and sighups are posted to mountd as often as it takes to reload the exports file(s), it will simply reload the exports file(s) over and over and over again, instead of processing Mount RPC requests. So, finally to the interesting part... - It seems that the code needs to be changed so that it won't "forget" sighup(s) posted to it, but it should not reload the exports file(s) too frequently. --> My thoughts are something like: - Note that sighup(s) were posted while reloading the exports file(s) and do the reload again, after some minimum delay. --> The minimum delay might only need to be 1second to allow some RPCs to be processed before reload happens again. Or --> The minimum delay could be some fraction of how long a reload takes. (The code could time the reload and use that to calculate how long to delay before doing the reload again.) Any ideas or suggestions? rick ps: I've actually known about this for some time, but since I didn't have a good solution... Build a system that allows adding and removing entries from the in mountd exports data so that you do not have to do a full reload every time one is added or removed? Build a system that used 2 exports tables, the active one, and the one that was being loaded, so that you can process RPC's and reloads at the same time. Well, r348590 modified mountd so that it built a new set of linked list structures from the modified exports file(s) and then compared them with the old ones, only doing updates to the kernel exports for changes. It still processes the entire exports file each time, to produce the in mountd memory linked lists (using hash tables and a binary tree). Peter did send me a patch to use a db frontend, but he felt the only performance improvements would be related to ZFS. Since ZFS is something I avoid like the plague I never pursued it. (If anyone willing to ZFS stuff wants to pursue this, just email and I can send you the patch.) Here's a snippet of what he said about it. It looks like a very simple patch to create and even though it wouldn’t really > improve the speed for the work that mountd does it would make possible really > drastic speed improvements in the zfs commands. They (zfs commands) currently > reads the thru text-based exports file multiple times when you do work with zfs > filesystems (mounting/sharing/changing share options etc). Using a db based exports file for the zfs exports (b-tree based probably) would allow the zfs code > to be much faster. At this point, I am just interested in fixing the problem in the PR, rick _______________________________________________ freebsd-net@freebsd.org<mailto:freebsd-net@freebsd.org> mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org<mailto:freebsd-net-unsubscr...@freebsd.org>" _______________________________________________ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"