Hi all, I am trying to setup a small SGE cluster with OpenMPI integrated but I am totally stuck when trying to run a openmpi job to the SGE's PE.
I mainly followed the guide sge-snow.pdf from Revolutions Computing and http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html The cluster is entirely ubuntu 10.10 based, both SGE 6.2u5 and OpenMPI 1.3 are directly from apt-get except OpenMPI is rebuilt from source with --with-sge flag. Note: OpenMPI has been installed on all execution hosts, not on the queue master and submission host. I submited a job by qsub -pe orte 8 ./ompi_job.sh The error I got looks like ============================================================================================================================= [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 161 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c at line 132 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c at line 541 ============================================================================================================================== For troubleshooting I have done several things below: 1) passwordless SSH has been configurated properly for the execution hosts and the queue master. pwbcad@sgeqmast01:~$ ssh sgeqexec01 uptime 14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02 2) I could run a openmpi job outside the SGE successfully. mpirun -host n1, n2 -np 8 ./ompi_job 3) I submitted job to a queue directly instead of a PE, the job could run and completed successfully qsub -q dev.q ./ompi_job.sh 4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't help. It will be very appreciated if anyone can share their experience! Derrick