Hi We have solved the problem by rewriting the starter.sh... The script remained the same except for the very final part where command is executed... Instead of plain exec "$@", we replaced it by:
========== #need for exec to fail on non-script jobs shopt -s execfail #start the job in thus shell exec "$@" #if the job is not sheel script but bash command, try to evaluate it eval "$@" ========== The error message still appears in the log file, but otherwise all seems ok... Thanx Ondrej Reuti wrote: > Am 01.12.2009 um 10:32 schrieb Ondrej Glembek: > >> Just to add more info: >> >> Reuti wrote: >>> Am 30.11.2009 um 20:07 schrieb Ondrej Glembek: >>> >>> But I think the real problem is, that Open MPI assumes you are outside >>> of SGE and so uses a different startup. Are you resetting any of SGE's >>> environment variables in your custom starter method (like $JOB_ID)? >> >> Also one of the reasons that makes me think that Open MPI knows it is >> inside of SGE is the dump of mpiexec (below) >> >> The first four lines show that starter.sh is called from mpiexec, having >> trouble with the (...) command... >> >> The last four lines show, that mpiexec knows the machines it is suppose >> tu run on... >> >> Thanx >> >> >> >> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found >> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found >> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found >> /usr/local/share/SGE/util/starter.sh: line 9: exec: (: not found > > You are right. So the question remains: why is Open MPI building such a > line at all. > > As you found the place in the source, it's done only for certain shells. > And I would assume only in case of an rsh/ssh startup. When you put a > `sleep 60` in your starter script: 1) it will of course delay the start > of the program, but when it gets to 2) mpiexec, you should see some > "qrsh -inherit ..." on the master node of the parallel job. Are these > present? > > -- Reuti > > >> -------------------------------------------------------------------------- >> >> A daemon (pid 30616) died unexpectedly with status 127 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >> the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> >> -------------------------------------------------------------------------- >> >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> >> blade57.fit.vutbr.cz - daemon did not report back when launched >> blade39.fit.vutbr.cz - daemon did not report back when launched >> blade41.fit.vutbr.cz - daemon did not report back when launched >> blade61.fit.vutbr.cz - daemon did not report back when launched >> >> >> >> >>> >>> -- Reuti >>> >>> >>>> >>>> Thanx >>>> Ondrej >>>> >>>> >>>> Reuti wrote: >>>>> Am 30.11.2009 um 18:46 schrieb Ondrej Glembek: >>>>>> Hi, thanx for reply... >>>>>> >>>>>> I tried to dump the $@ before calling the exec and here it is: >>>>>> >>>>>> >>>>>> ( test ! -r ./.profile || . ./.profile; >>>>>> PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/bin:$PATH ; export >>>>>> PATH ; >>>>>> LD_LIBRARY_PATH=/homes/kazi/glembek/share/openmpi-1.3.3-64/lib:$LD_LIBRARY_PATH >>>>>> >>>>>> ; export LD_LIBRARY_PATH ; >>>>>> /homes/kazi/glembek/share/openmpi-1.3.3-64/bin/orted -mca ess env >>>>>> -mca orte_ess_jobid 3870359552 -mca orte_ess_vpid 1 -mca >>>>>> orte_ess_num_procs 2 --hnp-uri >>>>>> "3870359552.0;tcp://147.229.8.134:53727" --mca >>>>>> pls_gridengine_verbose 1 --output-filename mpi.log ) >>>>>> >>>>>> >>>>>> It looks like the line gets constructed in >>>>>> orte/mca/plm/rsh/plm_rsh_module.c and depends on the shell... >>>>>> >>>>>> Still I wonder, why mpiexec calls the starter.sh... I thought the >>>>>> starter was supposed to call the script which wraps a call to >>>>>> mpiexec... >>>>> Correct. This will happen for the master node of this job, i.e. where >>>>> the jobscript is executed. But it will also be used for the qrsh >>>>> -inherit calls. I wonder about one thing: I see only a call to >>>>> "orted" and not the above sub-shell on my machines. Did you compile >>>>> Open MPI with --with-sge? >>>>> The original call above would be "ssh node_xy ( test ! ....)" which >>>>> seems working for ssh and rsh. >>>>> Just one note: with the starter script you will lose the set PATH and >>>>> LD_LIBRARY_PATH, as a new shell is created. It might be necessary to >>>>> set it again in your starter method. >>>>> -- Reuti >>>>>> >>>>>> Am I not right??? >>>>>> Ondrej >>>>>> >>>>>> >>>>>> Reuti wrote: >>>>>>> Hi, >>>>>>> Am 30.11.2009 um 16:33 schrieb Ondrej Glembek: >>>>>>>> we are using a custom starter method in our SGE to launch our >>>>>>>> jobs... It >>>>>>>> looks something like this: >>>>>>>> >>>>>>>> #!/bin/sh >>>>>>>> >>>>>>>> # ... we do whole bunch of stuff here >>>>>>>> >>>>>>>> #start the job in thus shell >>>>>>>> exec "$@" >>>>>>> the "$@" should be replaced by the path to the jobscript (qsub) or >>>>>>> command (qrsh) plus the given options. >>>>>>> For the spread tasks to other nodes I get as argument: " orted -mca >>>>>>> ess env -mca orte_ess_jobid ...". Also no . ./.profile. >>>>>>> So I wonder, where the . ./.profile is coming from. Can you put a >>>>>>> `sleep 60` or alike before the `exec ...` and grep the built line >>>>>>> from `ps -e f` before it crashes? >>>>>>> -- Reuti >>>>>>>> The trouble is that mpiexec passes a command which looks like this: >>>>>>>> >>>>>>>> ( . ./.profile ..... ) >>>>>>>> >>>>>>>> which, however, is not a valid exec argument... >>>>>>>> >>>>>>>> Is there any way to tell mpiexec to run it in a separate script??? >>>>>>>> Any >>>>>>>> idea how to solve this??? >>>>>>>> >>>>>>>> Thanx >>>>>>>> Ondrej Glembek >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >>>>>>>> UPGM FIT VUT Brno, L226 Web: >>>>>>>> http://www.fit.vutbr.cz/~glembek >>>>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292 >>>>>>>> Brno, Czech Republic Fax: +420 54114-1290 >>>>>>>> >>>>>>>> ICQ: 93233896 >>>>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> -- >>>>>> >>>>>> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >>>>>> UPGM FIT VUT Brno, L226 Web: >>>>>> http://www.fit.vutbr.cz/~glembek >>>>>> Bozetechova 2, 612 66 Phone: +420 54114-1292 >>>>>> Brno, Czech Republic Fax: +420 54114-1290 >>>>>> >>>>>> ICQ: 93233896 >>>>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> -- >>>> >>>> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >>>> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek >>>> Bozetechova 2, 612 66 Phone: +420 54114-1292 >>>> Brno, Czech Republic Fax: +420 54114-1290 >>>> >>>> ICQ: 93233896 >>>> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> >> Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz >> UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek >> Bozetechova 2, 612 66 Phone: +420 54114-1292 >> Brno, Czech Republic Fax: +420 54114-1290 >> >> ICQ: 93233896 >> GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Ondrej Glembek, PhD student E-mail: glem...@fit.vutbr.cz UPGM FIT VUT Brno, L226 Web: http://www.fit.vutbr.cz/~glembek Bozetechova 2, 612 66 Phone: +420 54114-1292 Brno, Czech Republic Fax: +420 54114-1290 ICQ: 93233896 GPG: C050 A6DC 7291 6776 9B69 BB11 C033 D756 6F33 DE3C