Are you able to run non-MPI programs like "hostname"? I ask because that error message indicates that everything started just fine, but there is an error in your application.
On Apr 6, 2011, at 6:01 PM, Jason Palmer wrote: > Btw, I did compile openmpi with the --with-sge flag. > > I am able to compile a test program using openf90 with no errors or > warnings. But when I try to run a test program that just calls > MPI_INIT(ierr), then MPI_COMM_RANK(ierr), I get the following, whether > static or linked, and whether run with mpirun or directly: > > [juggling.ucsd.edu:20218] *** An error occurred in MPI_Comm_rank > [juggling.ucsd.edu:20218] *** on communicator MPI_COMM_WORLD > [juggling.ucsd.edu:20218] *** MPI_ERR_COMM: invalid communicator > [juggling.ucsd.edu:20218] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > abort) > > Is there something missing in the linux or parallel environment settings? > Thanks. > > -----Original Message----- > From: Jason Palmer [mailto:japalme...@gmail.com] > Sent: Wednesday, April 06, 2011 4:09 PM > To: 'Open MPI Users' > Subject: SGE and openmpi > > Hi, > I am having trouble running a batch job in SGE using openmpi. I have read > the faq, which says that openmpi will automatically do the right thing, but > something seems to be wrong. > > Previously I used MPICH1 under SGE without any problems. I'm avoiding MPICH2 > because it doesn't seem to support static compilation, whereas I was able to > get openmpi to compile with open64 and compile my program statically. > > But I am having problems launching. According to the documentation, I should > be able to have a script file, qsub.sh: > > #!/bin/bash > #$ -cwd > #$ -j y > #$ -S /bin/bash > #$ -q all.q > #$ -pe orte 18 > MPI_DIR=/home/jason/openmpi-1.4.3-install/bin > /home/jason/openmpi-1.4.3-install/bin/mpirun -np $NSLOTS myprog > > Then, > $ qsub qsub.sh > > Previously with MPICH1 I would have > > -machinefile $TMP/machines > > in the mpirun arguments, and the rest of the script the same except -pe > mpich 18, and it would work. The -machinefile argument doesn't seem to work > in orte. The error in qsub.sh.o is: > > [jason@juggling ~/amica_open64]$ cat qsub.sh.o7514 [compute-0-0.local:17792] > *** An error occurred in MPI_Comm_rank [compute-0-0.local:17792] *** on > communicator MPI_COMM_WORLD [compute-0-0.local:17792] *** MPI_ERR_COMM: > invalid communicator [compute-0-0.local:17792] *** MPI_ERRORS_ARE_FATAL > (your MPI job will now abort) > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 17792 on node > compute-0-0.local exiting without calling "finalize". This may have caused > other processes in the application to be terminated by signals sent by > mpirun (as reported here). > -------------------------------------------------------------------------- > [compute-0-0.local:17788] 8 more processes have sent help message > help-mpi-errors.txt / mpi_errors_are_fatal [compute-0-0.local:17788] Set MCA > parameter "orte_base_help_aggregate" to 0 to see all help / error messages > > > I ran qconf, and I get the same output as in the documentation: > > [jason@juggling ~/amica_open64]$ qconf -sp orte > pe_name orte > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE > > The qconf mpich output is: > > [jason@juggling ~/amica_open64]$ qconf -sp mpich > pe_name mpich > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile > stop_proc_args /opt/gridengine/mpi/stopmpi.sh > allocation_rule $fill_up > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE > > with specific scripts for start_proc_args and stop_proc_args ... > > Am I missing something necessary to run openmpi under SGE? > > Thanks very much, > Jason > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users