Hi Prakashan,

I believe it might be something from PE setting. Could you try this:

Change this parameter in the 'orte' parallel environment from:
> job_is_first_task TRUE
to:
> job_is_first_task FALSE

If you have this set to true, it would take away an available slot in your job, so it might prevent an SGE 'task' from launching to one of your SGE nodes.

Korambath, Prakashan wrote:
Hi,

  I just compiled OpenMPI version 1.2.5 with the option


./configure --prefix=/u/local/mpi/openmpi/1.2.5 --with-openib=/usr/local --enable-static --disable-shared CC=icc CXX=icpc F77=ifort FC=ifort --with-sge

on a X86_64 machine with Infiniband Interconnect and OFED software and CentOS 5 OS

Everything works fine on command line job submission, but when I submit through SGE 6.1U3 I am getting following error

error: executing task of job 23081 failed:
[n99:01442] ERROR: A daemon on node n99 failed to start as expected.
[n99:01442] ERROR: There may be more information available from
[n99:01442] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[n99:01442] ERROR: If the problem persists, please restart the
[n99:01442] ERROR: Grid Engine PE job
[n99:01442] ERROR: The daemon exited unexpectedly with status 1.


In my command script for SGE I have
#$ -pe orte 2


/u/local/mpi/openmpi/1.2.5/bin/mpiexec -n 2 -machinefile $TMPDIR/nodefile  \
         /u/home2/ppk/MPI/C/executablename  >& output



n99:/work/23081.1.campus.q {1002}$ cat nodefile
n99  slots=1
n15  slots=1


n99:/work/23081.1.campus.q {1003}$ qconf -sp orte
pe_name           orte
slots             360
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    TRUE
job_is_first_task TRUE
urgency_slots     min


I am combing through the archives to look for similar errors. I have seen some of it, but no satisfactory answer. Anyone knows why?



i02:/u/local/mpi/openmpi/1.2.5/bin {1049}$ ./ompi_info | grep tm
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.5)

I also tried pre-relese 1.2.6rc3 same results.


Prakashan





------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--


- Pak Lui
pak....@sun.com

Reply via email to