Hi,
I am having trouble running a batch job in SGE using openmpi.  I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.

Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.

But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS  myprog

Then,
        $ qsub  qsub.sh

Previously with MPICH1 I would have

        -machinefile $TMP/machines

in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:

[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514
[compute-0-0.local:17792] *** An error occurred in MPI_Comm_rank
[compute-0-0.local:17792] *** on communicator MPI_COMM_WORLD
[compute-0-0.local:17792] *** MPI_ERR_COMM: invalid communicator
[compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 17792 on
node compute-0-0.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[compute-0-0.local:17788] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages


I ran qconf, and I get the same output as in the documentation:

[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_name            orte
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

The qconf mpich output is:

[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_name            mpich
slots              9999
user_lists         NONE
xuser_lists        NONE
start_proc_args    /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args     /opt/gridengine/mpi/stopmpi.sh
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE

with specific scripts for start_proc_args and stop_proc_args ...

Am I missing something necessary to run openmpi under SGE?

Thanks very much,
Jason

Reply via email to