Hi,
Am 19.02.2008 um 12:49 schrieb Neeraj Chourasia:
I am facing problem while calling mpirun in a loop when using
with SGE. My sge version is SGE6.1AR_snapshot3. The script i am
submitting via sge is
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
let i=0
while [ $i -lt 100 ]
do
echo
"#####################################################################
#######################"
echo "Iteration :$i"
/usr/local/openmpi-1.2.4/bin/mpirun -np $NP -hostfile $TMP/
machines send
let "i+=1"
echo
"#####################################################################
#######################"
done
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxx
Now above script runs well for 15-20 iteration and then fails with
following message
in a Tight Integration in SGE the qrsh on the slave node has to
perfom some housekeeping, e.g. removing the $TMPDIR (would be an RFE
to do this only once atv the end of the job. a) saves time, b) some
parallel jobs need persistent $TMPDIR on the slave nodes which must
for now be implemented by hand).
Could you put a wait of some seconds after the mpirun before the next
iteration? Maybe it will help.
-- Reuti
-------------------------Error
Message---------------------------------------------------------------
----
error: executing task of job 3869 failed: execution daemon on host
"n101" didn't accept task
[n199:11989] ERROR: A daemon on node n101 failed to start as expected.
[n199:11989] ERROR: There may be more information available from
[n199:11989] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[n199:11989] ERROR: If the problem persists, please restart the
[n199:11989] ERROR: Grid Engine PE job
[n199:11989] ERROR: The daemon exited unexpectedly with status 1.
----------------------------------------------------------------------
-------------------------------------
When i do ssh to n101, there is no orted and qrsh_starter running.
While checking its spool file, i came across following message
-----------------------------------------------Execd spool Error
Message---------------------------------
|execd|n101|E|no free queue for job 3869 of user neeraj@n199
(localhost = n101)
----------------------------------------------------------------------
-------------------------------------------------
What could be the reason for it.
While checking the mailing list, i come across following link
http://www.open-mpi.org/community/lists/users/2007/03/2771.php
but, i dont think its the same problem. Any help is appreciated.
Regards
Neeraj
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users