Hi,
I am having trouble running a batch job in SGE using openmpi. I have read
the faq, which says that openmpi will automatically do the right thing, but
something seems to be wrong.
Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2
because it doesn't seem to support static compilation, whereas I was able to
get openmpi to compile with open64 and compile my program statically.
But I am having problems launching. According to the documentation, I should
be able to have a script file, qsub.sh:
#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -q all.q
#$ -pe orte 18
MPI_DIR=/home/jason/openmpi-1.4.3-install/bin
/home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS myprog
Then,
$ qsub qsub.sh
Previously with MPICH1 I would have
-machinefile $TMP/machines
in the mpirun arguments, and the rest of the script the same except -pe
mpich 18, and it would work. The -machinefile argument doesn't seem to work
in orte. The error in qsub.sh.o is:
[jason@juggling ~/amica_open64]$ cat qsub.sh.o7514
[compute-0-0.local:17792] *** An error occurred in MPI_Comm_rank
[compute-0-0.local:17792] *** on communicator MPI_COMM_WORLD
[compute-0-0.local:17792] *** MPI_ERR_COMM: invalid communicator
[compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 17792 on
node compute-0-0.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-0.local:17788] 8 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[compute-0-0.local:17788] Set MCA parameter "orte_base_help_aggregate" to 0
to see all help / error messages
I ran qconf, and I get the same output as in the documentation:
[jason@juggling ~/amica_open64]$ qconf -sp orte
pe_name orte
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
The qconf mpich output is:
[jason@juggling ~/amica_open64]$ qconf -sp mpich
pe_name mpich
slots 9999
user_lists NONE
xuser_lists NONE
start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/gridengine/mpi/stopmpi.sh
allocation_rule $fill_up
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
with specific scripts for start_proc_args and stop_proc_args ...
Am I missing something necessary to run openmpi under SGE?
Thanks very much,
Jason