haha OK. If I were you, I would have built lsmake functionality in GMAKE and not pay IBM lol.
Anyways, have a good day. :) On Mon, Sep 2, 2019 at 11:45 PM David Boyce <david.s.bo...@gmail.com> wrote: > I did not suggest using lsmake, I simply mentioned that we use it. > > On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <jainnikhi...@gmail.com> > wrote: > >> Thanks for detailed information. >> >> I will see if I can use shell wrapper program as mentioned by you. >> >> I had used LSF a lot like for 5 years. I still use it. bsub, bjobs. >> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use >> >> lsmake - I do not want to use IBM's proprietary stuff. >> >> Thanks for your suggestions. >> >> Nikhil >> >> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <david.s.bo...@gmail.com> >> wrote: >> >>> I'm not going to address the remote execution topic since it sounds like >>> you already have the solution and are not looking for help. However, I do >>> have fairly extensive experience with the NFS/retry area so will try to >>> contribute there. >>> >>> First, I don't think what Paul says: >>> >>> > As for your NFS issue, another option would be to enable the .ONESHELL >>> > feature available in newer versions of GNU make: that will ensure that >>> > all lines in a recipe are invoked in a single shell, which means that >>> > they should all be invoked on the same remote host. >>> >>> Is sufficient. Consider the typical case of compiling foo.c to foo.o and >>> linking it into foo.exe. Typically, and correctly, those actions would be >>> in two separate recipes which in a distributed-build scenario could run on >>> different hosts so the linker may not find the .o file from a previous >>> recipe. Here .ONSHELL cannot help since they're different recipes. >>> >>> In my day job we use a product from IBM called LSF (Load Sharing >>> F-something, >>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html) >>> which exists to distribute jobs over a server farm (typically using NFS) >>> according to various metrics like load and free memory and so on. Part of >>> the LSF package is a program called lsmake ( >>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html) >>> which under the covers is a version of GNU make with enhancement to enable >>> remote/distributed recipes and also adds retry-with-delay feature Nikhil >>> requested). Since GNU make is GPL, IBM is required to make its package of >>> enhancements available under GPL as well. Much of it is not of direct >>> interest to the open source community because it's all about communicating >>> with IBM's proprietary daemons but their retry logic could probably be >>> taken directly from the patch. At the very least, if retries were to be >>> added to GNU make per se it would be nice if the flags were compatible with >>> lsmake. >>> >>> However, my personal belief is that retries are God's way of telling us >>> to think harder and better. Retrying (and worse, delay-and-retry) is a form >>> of defeatism which I call "sleep and hope". Computers are deterministic, >>> there's always a root cause which can usually be found and addressed with >>> sufficient analysis, etc. Granted there are cases where you understand the >>> problem but can't address it for administrative/permissions/business >>> reasons but that can't be known until the problem is understood. >>> >>> NFS caching is the root cause of unreliable distributed builds, as >>> you've already described, but most or all of these issues can be addressed >>> with a less blunt instrument than sleep-and-retry. Even LSF engineers threw >>> up their hands and did retries but what we did here was take their patch, >>> which at last check was still targeted to 3.81, and while porting it to 4.1 >>> added some of the cache-flushing strategies detailed below. This has solved >>> most if not all of our NFS sync problems. Caveat: most of our people still >>> use the LSF retry logic in addition, because they're not as absolutist as I >>> am and just want to get their jobs done (go figure), which makes it harder >>> to determine what percentage of problems are solved by cache flushing vs >>> retries but I'm pretty sure flushing has resolved the great majority of >>> problems. >>> >>> One problem with sleep-and-hope is that there's no amount of time >>> guaranteed to be enough so you're just driving the incidence rate down, not >>> fixing it. >>> >>> Since we were already working with a hacked version of GNU make we found >>> it most convenient to implement flushing directly in the make program but >>> it can also be done within recipes. In fact we have 3 different >>> implementations of the same NFS cache flushing logic: >>> >>> 1. Directly within our enhanced version of lsmake. >>> 2. In a standalone binary called "nfsflush". >>> 3. In a Python script called nfsflush.py. >>> >>> The Python script is a lab for trying out new strategies but it's too >>> slow for production use. The binary is a faster version of the same >>> techniques for direct use in recipes, and that same C code is linked >>> directly into lsmake as well. Here's the usage message of our Python script: >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] >>> [-V] path [path ...]positional arguments: path directory paths >>> to flushoptional arguments: -h, --help show this help message and >>> exit -f, --fsync Use fsync() on named files -l, --lock Lock >>> and unlock the named files -r, --recursive flush recursively -t, --touch >>> additional flush action - touch and remove a temp file -u, --upflush >>> flush parent directories back to the root -V, --verbose increment >>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly >>> created files are sometimes unavailable via NFS for a periodof time due to >>> filehandle caching, leading to apparent race problems.See >>> http://tss.iki.fi/nfs-coding-howto.html#fhcache >>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script >>> forces a flush using techniques mentioned in the URL. Itcan optionally do >>> so recursively.This always does an opendir/closedir sequence on each >>> directoryvisited, as described in the URL, because that's cheap and safe >>> andoften sufficient. Other strategies, such as creating and removing atemp >>> file, are optional.EXAMPLES: nfsflush.py /nfs/path/...* >>> >>> The most important thing is to read the URL given above and/or to google >>> for similar resource of which there are many. While I'm not an NFS guru >>> myself, the summary of my understanding is that NFS caches all sorts of >>> things (metadata like atime/mtime, directory updates, etc) with various >>> degrees of aggression according to NFS vendor and internal configuration. >>> We've seen substantial variation between NAS providers such as NetApp, EMC, >>> etc, so much depends on whose NFS server you're using. However, the NFS >>> spec _requires_ that caches be flushed on a write operation so all >>> implementations will do this. >>> >>> Bottom line, the most common failure case is as mentioned above: foo.o >>> is compiled on host A and immediately linked on host B. The close() system >>> call following the final write() of foo.o on host A will cause its data to >>> be flushed. Similarly I *believe* the directory write (assuming foo.o is >>> newly created and not just updated) will cause the filehandle cache to be >>> flushed. Thus, after these two write ops (directory and file) the NFS >>> server will know about the new foo.o as soon as it's created. >>> >>> The problem typically arises on host B because no write operation has >>> taken place there after foo.o was created on A so no one has told it to >>> update its caches and as a result it doesn't know foo.o exists and the link >>> fails with ENOENT. All the flushing techniques in the script above are >>> attempts to address this. One takeaway from all this is that even if you do >>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In >>> other words the most efficient retry formula in a distributed build >>> scenario would be: >>> >>> <recipe> || flush || <recipe> >>> >>> This never flushes a cache unless the first attempt fails. It presumes >>> that NFS implementors and admins know what they're doing and thus caching >>> helps with performance so it's not done unless needed. This is what we >>> built into our variant of lsmake. However, the same can also be done in the >>> shell. >>> >>> Details about implemented cache flushing techniques: the filehandle >>> cache is the biggest source of problems in distributed builds and the >>> simplest solution for it seems to be opening and reading the directory >>> entry. Thus our script and its parallel C implementation always do that. >>> We've also seen cases where forcing a directory write operation is required >>> which the -t, --touch option does. Sometimes you can't easily enumerate all >>> directories involved (vpath etc) so the recurse-downward (-r) and recurse >>> upward (-u) flags may be helpful though they (especially -u) may also be >>> overkill. The -f and -l options were added based on advice found on the net >>> but have not been shown to be helpful in our environment. >>> >>> Some techniques may be of limited utility because they require write >>> and/or ownership privileges. For instance I've seen statements that >>> umounts, even failed umounts, will force flushes. Thus a command like "cd >>> <dir> && umount $(pwd)" would have to fail since the moount is busy but >>> would flush as a side effect. However I believe this requires root >>> privileges so is not helpful in the normal case. >>> >>> In summary: although I don't believe in retries, if they're going to be >>> used I think they should be implemented in a shell wrapper program which >>> could be passed to make as SHELL=<wrapper> and the wrapper should use >>> flushing in addition to, or instead of, retries. We didn't do it that way >>> but I think our nfsflush program could just as well have been implemented >>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would >>> run the recipe along with added flushing and retrying options. I agree with >>> Paul that I see no reason to implement any of these features, retry and/or >>> flush, directly in make. >>> >>> David >>> >>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <psm...@gnu.org> wrote: >>> >>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote: >>>> > If your R&D team would allow you to add just one line to the >>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that >>>> > to a shell wrapper program which performs command re-trying. >>>> >>>> You don't have to add any lines to the makefile. You can reset SHELL >>>> on the command line, just like any other make variable: >>>> >>>> make SHELL=/my/special/sh >>>> >>>> You can even override it only for specific targets using the --eval >>>> command line option: >>>> >>>> make --eval 'somerule: SHELL := /my/special/sh' >>>> >>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command >>>> line to force reading of a personal makefile before the standard >>>> makefile. >>>> >>>> Clearly you can modify the command line, otherwise adding new options >>>> to control a putative retry on error option would not be possible. >>>> >>>> As for your NFS issue, another option would be to enable the .ONESHELL >>>> feature available in newer versions of GNU make: that will ensure that >>>> all lines in a recipe are invoked in a single shell, which means that >>>> they should all be invoked on the same remote host. This can also be >>>> done from the command line, as above. If your recipes are written well >>>> it should Just Work. If they aren't, and you can't fix them, then >>>> obviously this solution won't work for you. >>>> >>>> Regarding changes to set re-invocation on failure, at this time I don't >>>> believe it's something I'd be willing to add to GNU make directly, >>>> especially not an option that simply retries every failed job. This is >>>> almost never useful (why would you want to retry a compile, or link, or >>>> similar? It will always just fail again, take longer, and generate >>>> confusing duplicate output--at best). >>>> >>>> The right answer for this problem is to modify the makefile to properly >>>> retry those specific rules which need it. >>>> >>>> I commiserate with you that your environment is static and you're not >>>> permitted to modify it, however adding new specialized capabilities to >>>> GNU make so that makefiles don't have to be modified isn't a design >>>> philosophy I want to adopt. >>>> >>>> >>>> _______________________________________________ >>>> Help-make mailing list >>>> Help-make@gnu.org >>>> https://lists.gnu.org/mailman/listinfo/help-make >>>> >>> _______________________________________________ Help-make mailing list Help-make@gnu.org https://lists.gnu.org/mailman/listinfo/help-make