I'm not going to address the remote execution topic since it sounds like you already have the solution and are not looking for help. However, I do have fairly extensive experience with the NFS/retry area so will try to contribute there.
First, I don't think what Paul says: > As for your NFS issue, another option would be to enable the .ONESHELL > feature available in newer versions of GNU make: that will ensure that > all lines in a recipe are invoked in a single shell, which means that > they should all be invoked on the same remote host. Is sufficient. Consider the typical case of compiling foo.c to foo.o and linking it into foo.exe. Typically, and correctly, those actions would be in two separate recipes which in a distributed-build scenario could run on different hosts so the linker may not find the .o file from a previous recipe. Here .ONSHELL cannot help since they're different recipes. In my day job we use a product from IBM called LSF (Load Sharing F-something, https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html) which exists to distribute jobs over a server farm (typically using NFS) according to various metrics like load and free memory and so on. Part of the LSF package is a program called lsmake ( https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html) which under the covers is a version of GNU make with enhancement to enable remote/distributed recipes and also adds retry-with-delay feature Nikhil requested). Since GNU make is GPL, IBM is required to make its package of enhancements available under GPL as well. Much of it is not of direct interest to the open source community because it's all about communicating with IBM's proprietary daemons but their retry logic could probably be taken directly from the patch. At the very least, if retries were to be added to GNU make per se it would be nice if the flags were compatible with lsmake. However, my personal belief is that retries are God's way of telling us to think harder and better. Retrying (and worse, delay-and-retry) is a form of defeatism which I call "sleep and hope". Computers are deterministic, there's always a root cause which can usually be found and addressed with sufficient analysis, etc. Granted there are cases where you understand the problem but can't address it for administrative/permissions/business reasons but that can't be known until the problem is understood. NFS caching is the root cause of unreliable distributed builds, as you've already described, but most or all of these issues can be addressed with a less blunt instrument than sleep-and-retry. Even LSF engineers threw up their hands and did retries but what we did here was take their patch, which at last check was still targeted to 3.81, and while porting it to 4.1 added some of the cache-flushing strategies detailed below. This has solved most if not all of our NFS sync problems. Caveat: most of our people still use the LSF retry logic in addition, because they're not as absolutist as I am and just want to get their jobs done (go figure), which makes it harder to determine what percentage of problems are solved by cache flushing vs retries but I'm pretty sure flushing has resolved the great majority of problems. One problem with sleep-and-hope is that there's no amount of time guaranteed to be enough so you're just driving the incidence rate down, not fixing it. Since we were already working with a hacked version of GNU make we found it most convenient to implement flushing directly in the make program but it can also be done within recipes. In fact we have 3 different implementations of the same NFS cache flushing logic: 1. Directly within our enhanced version of lsmake. 2. In a standalone binary called "nfsflush". 3. In a Python script called nfsflush.py. The Python script is a lab for trying out new strategies but it's too slow for production use. The binary is a faster version of the same techniques for direct use in recipes, and that same C code is linked directly into lsmake as well. Here's the usage message of our Python script: *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] [-V] path [path ...]positional arguments: path directory paths to flushoptional arguments: -h, --help show this help message and exit -f, --fsync Use fsync() on named files -l, --lock Lock and unlock the named files -r, --recursive flush recursively -t, --touch additional flush action - touch and remove a temp file -u, --upflush flush parent directories back to the root -V, --verbose increment verbosity levelFlush the NFS filehandle caches of NFS directories.Newly created files are sometimes unavailable via NFS for a periodof time due to filehandle caching, leading to apparent race problems.See http://tss.iki.fi/nfs-coding-howto.html#fhcache <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script forces a flush using techniques mentioned in the URL. Itcan optionally do so recursively.This always does an opendir/closedir sequence on each directoryvisited, as described in the URL, because that's cheap and safe andoften sufficient. Other strategies, such as creating and removing atemp file, are optional.EXAMPLES: nfsflush.py /nfs/path/...* The most important thing is to read the URL given above and/or to google for similar resource of which there are many. While I'm not an NFS guru myself, the summary of my understanding is that NFS caches all sorts of things (metadata like atime/mtime, directory updates, etc) with various degrees of aggression according to NFS vendor and internal configuration. We've seen substantial variation between NAS providers such as NetApp, EMC, etc, so much depends on whose NFS server you're using. However, the NFS spec _requires_ that caches be flushed on a write operation so all implementations will do this. Bottom line, the most common failure case is as mentioned above: foo.o is compiled on host A and immediately linked on host B. The close() system call following the final write() of foo.o on host A will cause its data to be flushed. Similarly I *believe* the directory write (assuming foo.o is newly created and not just updated) will cause the filehandle cache to be flushed. Thus, after these two write ops (directory and file) the NFS server will know about the new foo.o as soon as it's created. The problem typically arises on host B because no write operation has taken place there after foo.o was created on A so no one has told it to update its caches and as a result it doesn't know foo.o exists and the link fails with ENOENT. All the flushing techniques in the script above are attempts to address this. One takeaway from all this is that even if you do retries, a "dumb" retry is immeasurably enhanced by adding a flush. In other words the most efficient retry formula in a distributed build scenario would be: <recipe> || flush || <recipe> This never flushes a cache unless the first attempt fails. It presumes that NFS implementors and admins know what they're doing and thus caching helps with performance so it's not done unless needed. This is what we built into our variant of lsmake. However, the same can also be done in the shell. Details about implemented cache flushing techniques: the filehandle cache is the biggest source of problems in distributed builds and the simplest solution for it seems to be opening and reading the directory entry. Thus our script and its parallel C implementation always do that. We've also seen cases where forcing a directory write operation is required which the -t, --touch option does. Sometimes you can't easily enumerate all directories involved (vpath etc) so the recurse-downward (-r) and recurse upward (-u) flags may be helpful though they (especially -u) may also be overkill. The -f and -l options were added based on advice found on the net but have not been shown to be helpful in our environment. Some techniques may be of limited utility because they require write and/or ownership privileges. For instance I've seen statements that umounts, even failed umounts, will force flushes. Thus a command like "cd <dir> && umount $(pwd)" would have to fail since the moount is busy but would flush as a side effect. However I believe this requires root privileges so is not helpful in the normal case. In summary: although I don't believe in retries, if they're going to be used I think they should be implemented in a shell wrapper program which could be passed to make as SHELL=<wrapper> and the wrapper should use flushing in addition to, or instead of, retries. We didn't do it that way but I think our nfsflush program could just as well have been implemented as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would run the recipe along with added flushing and retrying options. I agree with Paul that I see no reason to implement any of these features, retry and/or flush, directly in make. David On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <psm...@gnu.org> wrote: > On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote: > > If your R&D team would allow you to add just one line to the > > legacy GNU Makefile to assign the SHELL variable, you can assign that > > to a shell wrapper program which performs command re-trying. > > You don't have to add any lines to the makefile. You can reset SHELL > on the command line, just like any other make variable: > > make SHELL=/my/special/sh > > You can even override it only for specific targets using the --eval > command line option: > > make --eval 'somerule: SHELL := /my/special/sh' > > Or, you can add '-f mymakefile.mk -f Makefile' options to the command > line to force reading of a personal makefile before the standard > makefile. > > Clearly you can modify the command line, otherwise adding new options > to control a putative retry on error option would not be possible. > > As for your NFS issue, another option would be to enable the .ONESHELL > feature available in newer versions of GNU make: that will ensure that > all lines in a recipe are invoked in a single shell, which means that > they should all be invoked on the same remote host. This can also be > done from the command line, as above. If your recipes are written well > it should Just Work. If they aren't, and you can't fix them, then > obviously this solution won't work for you. > > Regarding changes to set re-invocation on failure, at this time I don't > believe it's something I'd be willing to add to GNU make directly, > especially not an option that simply retries every failed job. This is > almost never useful (why would you want to retry a compile, or link, or > similar? It will always just fail again, take longer, and generate > confusing duplicate output--at best). > > The right answer for this problem is to modify the makefile to properly > retry those specific rules which need it. > > I commiserate with you that your environment is static and you're not > permitted to modify it, however adding new specialized capabilities to > GNU make so that makefiles don't have to be modified isn't a design > philosophy I want to adopt. > > > _______________________________________________ > Help-make mailing list > Help-make@gnu.org > https://lists.gnu.org/mailman/listinfo/help-make > _______________________________________________ Help-make mailing list Help-make@gnu.org https://lists.gnu.org/mailman/listinfo/help-make