Hi, I am having trouble running a batch job in SGE using openmpi. I have read the faq, which says that openmpi will automatically do the right thing, but something seems to be wrong.
Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2 because it doesn't seem to support static compilation, whereas I was able to get openmpi to compile with open64 and compile my program statically. But I am having problems launching. According to the documentation, I should be able to have a script file, qsub.sh: #!/bin/bash #$ -cwd #$ -j y #$ -S /bin/bash #$ -q all.q #$ -pe orte 18 MPI_DIR=/home/jason/openmpi-1.4.3-install/bin /home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS myprog Then, $ qsub qsub.sh Previously with MPICH1 I would have -machinefile $TMP/machines in the mpirun arguments, and the rest of the script the same except -pe mpich 18, and it would work. The -machinefile argument doesn't seem to work in orte. The error in qsub.sh.o is: [jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792] *** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM: invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 17792 on node compute-0-0.local exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [compute-0-0.local:17788] 8 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages I ran qconf, and I get the same output as in the documentation: [jason@juggling ~/amica_open64]$ qconf -sp orte pe_name orte slots 9999 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE The qconf mpich output is: [jason@juggling ~/amica_open64]$ qconf -sp mpich pe_name mpich slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE with specific scripts for start_proc_args and stop_proc_args ... Am I missing something necessary to run openmpi under SGE? Thanks very much, Jason