Hi, I have a query. Sorry to bother you again.
Can you please let me know when I do a SIGINT to the running make or do a ctrl+c, which function is called at the last ? I want to add some logic in there. Please help. This is urgent. Thanks in advance. Nikhil On Mon, 2 Sep 2019, 23:49 nikhil jain, <jainnikhi...@gmail.com> wrote: > haha OK. > > If I were you, I would have built lsmake functionality in GMAKE and not > pay IBM lol. > > Anyways, have a good day. :) > > On Mon, Sep 2, 2019 at 11:45 PM David Boyce <david.s.bo...@gmail.com> > wrote: > >> I did not suggest using lsmake, I simply mentioned that we use it. >> >> On Mon, Sep 2, 2019 at 11:04 AM nikhil jain <jainnikhi...@gmail.com> >> wrote: >> >>> Thanks for detailed information. >>> >>> I will see if I can use shell wrapper program as mentioned by you. >>> >>> I had used LSF a lot like for 5 years. I still use it. bsub, bjobs. >>> bkill, lim, sbatchd, mbatchd etc. it is easy to understand and use >>> >>> lsmake - I do not want to use IBM's proprietary stuff. >>> >>> Thanks for your suggestions. >>> >>> Nikhil >>> >>> On Mon, Sep 2, 2019 at 11:10 PM David Boyce <david.s.bo...@gmail.com> >>> wrote: >>> >>>> I'm not going to address the remote execution topic since it sounds >>>> like you already have the solution and are not looking for help. However, I >>>> do have fairly extensive experience with the NFS/retry area so will try to >>>> contribute there. >>>> >>>> First, I don't think what Paul says: >>>> >>>> > As for your NFS issue, another option would be to enable the .ONESHELL >>>> > feature available in newer versions of GNU make: that will ensure that >>>> > all lines in a recipe are invoked in a single shell, which means that >>>> > they should all be invoked on the same remote host. >>>> >>>> Is sufficient. Consider the typical case of compiling foo.c to foo.o >>>> and linking it into foo.exe. Typically, and correctly, those actions would >>>> be in two separate recipes which in a distributed-build scenario could run >>>> on different hosts so the linker may not find the .o file from a previous >>>> recipe. Here .ONSHELL cannot help since they're different recipes. >>>> >>>> In my day job we use a product from IBM called LSF (Load Sharing >>>> F-something, >>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_welcome.html) >>>> which exists to distribute jobs over a server farm (typically using NFS) >>>> according to various metrics like load and free memory and so on. Part of >>>> the LSF package is a program called lsmake ( >>>> https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/lsmake.1.html) >>>> which under the covers is a version of GNU make with enhancement to enable >>>> remote/distributed recipes and also adds retry-with-delay feature Nikhil >>>> requested). Since GNU make is GPL, IBM is required to make its package of >>>> enhancements available under GPL as well. Much of it is not of direct >>>> interest to the open source community because it's all about communicating >>>> with IBM's proprietary daemons but their retry logic could probably be >>>> taken directly from the patch. At the very least, if retries were to be >>>> added to GNU make per se it would be nice if the flags were compatible with >>>> lsmake. >>>> >>>> However, my personal belief is that retries are God's way of telling us >>>> to think harder and better. Retrying (and worse, delay-and-retry) is a form >>>> of defeatism which I call "sleep and hope". Computers are deterministic, >>>> there's always a root cause which can usually be found and addressed with >>>> sufficient analysis, etc. Granted there are cases where you understand the >>>> problem but can't address it for administrative/permissions/business >>>> reasons but that can't be known until the problem is understood. >>>> >>>> NFS caching is the root cause of unreliable distributed builds, as >>>> you've already described, but most or all of these issues can be addressed >>>> with a less blunt instrument than sleep-and-retry. Even LSF engineers threw >>>> up their hands and did retries but what we did here was take their patch, >>>> which at last check was still targeted to 3.81, and while porting it to 4.1 >>>> added some of the cache-flushing strategies detailed below. This has solved >>>> most if not all of our NFS sync problems. Caveat: most of our people still >>>> use the LSF retry logic in addition, because they're not as absolutist as I >>>> am and just want to get their jobs done (go figure), which makes it harder >>>> to determine what percentage of problems are solved by cache flushing vs >>>> retries but I'm pretty sure flushing has resolved the great majority of >>>> problems. >>>> >>>> One problem with sleep-and-hope is that there's no amount of time >>>> guaranteed to be enough so you're just driving the incidence rate down, not >>>> fixing it. >>>> >>>> Since we were already working with a hacked version of GNU make we >>>> found it most convenient to implement flushing directly in the make program >>>> but it can also be done within recipes. In fact we have 3 different >>>> implementations of the same NFS cache flushing logic: >>>> >>>> 1. Directly within our enhanced version of lsmake. >>>> 2. In a standalone binary called "nfsflush". >>>> 3. In a Python script called nfsflush.py. >>>> >>>> The Python script is a lab for trying out new strategies but it's too >>>> slow for production use. The binary is a faster version of the same >>>> techniques for direct use in recipes, and that same C code is linked >>>> directly into lsmake as well. Here's the usage message of our Python >>>> script: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *$ nfsflush.py --helpusage: nfsflush.py [-h] [-f] [-l] [-r] [-t] [-u] >>>> [-V] path [path ...]positional arguments: path directory paths >>>> to flushoptional arguments: -h, --help show this help message and >>>> exit -f, --fsync Use fsync() on named files -l, --lock Lock >>>> and unlock the named files -r, --recursive flush recursively -t, --touch >>>> additional flush action - touch and remove a temp file -u, --upflush >>>> flush parent directories back to the root -V, --verbose increment >>>> verbosity levelFlush the NFS filehandle caches of NFS directories.Newly >>>> created files are sometimes unavailable via NFS for a periodof time due to >>>> filehandle caching, leading to apparent race problems.See >>>> http://tss.iki.fi/nfs-coding-howto.html#fhcache >>>> <http://tss.iki.fi/nfs-coding-howto.html#fhcache> for details.This script >>>> forces a flush using techniques mentioned in the URL. Itcan optionally do >>>> so recursively.This always does an opendir/closedir sequence on each >>>> directoryvisited, as described in the URL, because that's cheap and safe >>>> andoften sufficient. Other strategies, such as creating and removing atemp >>>> file, are optional.EXAMPLES: nfsflush.py /nfs/path/...* >>>> >>>> The most important thing is to read the URL given above and/or to >>>> google for similar resource of which there are many. While I'm not an NFS >>>> guru myself, the summary of my understanding is that NFS caches all sorts >>>> of things (metadata like atime/mtime, directory updates, etc) with various >>>> degrees of aggression according to NFS vendor and internal configuration. >>>> We've seen substantial variation between NAS providers such as NetApp, EMC, >>>> etc, so much depends on whose NFS server you're using. However, the NFS >>>> spec _requires_ that caches be flushed on a write operation so all >>>> implementations will do this. >>>> >>>> Bottom line, the most common failure case is as mentioned above: foo.o >>>> is compiled on host A and immediately linked on host B. The close() system >>>> call following the final write() of foo.o on host A will cause its data to >>>> be flushed. Similarly I *believe* the directory write (assuming foo.o is >>>> newly created and not just updated) will cause the filehandle cache to be >>>> flushed. Thus, after these two write ops (directory and file) the NFS >>>> server will know about the new foo.o as soon as it's created. >>>> >>>> The problem typically arises on host B because no write operation has >>>> taken place there after foo.o was created on A so no one has told it to >>>> update its caches and as a result it doesn't know foo.o exists and the link >>>> fails with ENOENT. All the flushing techniques in the script above are >>>> attempts to address this. One takeaway from all this is that even if you do >>>> retries, a "dumb" retry is immeasurably enhanced by adding a flush. In >>>> other words the most efficient retry formula in a distributed build >>>> scenario would be: >>>> >>>> <recipe> || flush || <recipe> >>>> >>>> This never flushes a cache unless the first attempt fails. It presumes >>>> that NFS implementors and admins know what they're doing and thus caching >>>> helps with performance so it's not done unless needed. This is what we >>>> built into our variant of lsmake. However, the same can also be done in the >>>> shell. >>>> >>>> Details about implemented cache flushing techniques: the filehandle >>>> cache is the biggest source of problems in distributed builds and the >>>> simplest solution for it seems to be opening and reading the directory >>>> entry. Thus our script and its parallel C implementation always do that. >>>> We've also seen cases where forcing a directory write operation is required >>>> which the -t, --touch option does. Sometimes you can't easily enumerate all >>>> directories involved (vpath etc) so the recurse-downward (-r) and recurse >>>> upward (-u) flags may be helpful though they (especially -u) may also be >>>> overkill. The -f and -l options were added based on advice found on the net >>>> but have not been shown to be helpful in our environment. >>>> >>>> Some techniques may be of limited utility because they require write >>>> and/or ownership privileges. For instance I've seen statements that >>>> umounts, even failed umounts, will force flushes. Thus a command like "cd >>>> <dir> && umount $(pwd)" would have to fail since the moount is busy but >>>> would flush as a side effect. However I believe this requires root >>>> privileges so is not helpful in the normal case. >>>> >>>> In summary: although I don't believe in retries, if they're going to be >>>> used I think they should be implemented in a shell wrapper program which >>>> could be passed to make as SHELL=<wrapper> and the wrapper should use >>>> flushing in addition to, or instead of, retries. We didn't do it that way >>>> but I think our nfsflush program could just as well have been implemented >>>> as say "nfsshell" such that "nfsshell [other-options] -c <recipe>" would >>>> run the recipe along with added flushing and retrying options. I agree with >>>> Paul that I see no reason to implement any of these features, retry and/or >>>> flush, directly in make. >>>> >>>> David >>>> >>>> On Mon, Sep 2, 2019 at 6:05 AM Paul Smith <psm...@gnu.org> wrote: >>>> >>>>> On Sun, 2019-09-01 at 23:23 -0700, Kaz Kylheku (gmake) wrote: >>>>> > If your R&D team would allow you to add just one line to the >>>>> > legacy GNU Makefile to assign the SHELL variable, you can assign that >>>>> > to a shell wrapper program which performs command re-trying. >>>>> >>>>> You don't have to add any lines to the makefile. You can reset SHELL >>>>> on the command line, just like any other make variable: >>>>> >>>>> make SHELL=/my/special/sh >>>>> >>>>> You can even override it only for specific targets using the --eval >>>>> command line option: >>>>> >>>>> make --eval 'somerule: SHELL := /my/special/sh' >>>>> >>>>> Or, you can add '-f mymakefile.mk -f Makefile' options to the command >>>>> line to force reading of a personal makefile before the standard >>>>> makefile. >>>>> >>>>> Clearly you can modify the command line, otherwise adding new options >>>>> to control a putative retry on error option would not be possible. >>>>> >>>>> As for your NFS issue, another option would be to enable the .ONESHELL >>>>> feature available in newer versions of GNU make: that will ensure that >>>>> all lines in a recipe are invoked in a single shell, which means that >>>>> they should all be invoked on the same remote host. This can also be >>>>> done from the command line, as above. If your recipes are written well >>>>> it should Just Work. If they aren't, and you can't fix them, then >>>>> obviously this solution won't work for you. >>>>> >>>>> Regarding changes to set re-invocation on failure, at this time I don't >>>>> believe it's something I'd be willing to add to GNU make directly, >>>>> especially not an option that simply retries every failed job. This is >>>>> almost never useful (why would you want to retry a compile, or link, or >>>>> similar? It will always just fail again, take longer, and generate >>>>> confusing duplicate output--at best). >>>>> >>>>> The right answer for this problem is to modify the makefile to properly >>>>> retry those specific rules which need it. >>>>> >>>>> I commiserate with you that your environment is static and you're not >>>>> permitted to modify it, however adding new specialized capabilities to >>>>> GNU make so that makefiles don't have to be modified isn't a design >>>>> philosophy I want to adopt. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Help-make mailing list >>>>> Help-make@gnu.org >>>>> https://lists.gnu.org/mailman/listinfo/help-make >>>>> >>>> _______________________________________________ Help-make mailing list Help-make@gnu.org https://lists.gnu.org/mailman/listinfo/help-make