Am 31.01.2009 um 08:49 schrieb Sangamesh B:
On Fri, Jan 30, 2009 at 10:20 PM, Reuti <re...@staff.uni- marburg.de> wrote:Am 30.01.2009 um 15:02 schrieb Sangamesh B:Dear Open MPI, Do you have a solution for the following problem of Open MPI (1.3) when run through Grid Engine. I changed global execd params with H_MEMORYLOCKED=infinity and restarted the sgeexecd in all nodes. But still the problem persists: $cat err.77.CPMD-OMPI ssh_exchange_identification: Connection closed by remote hostI think this might already be the reason why it's not working. A mpihelloprogram is running fine through SGE?No. Any Open MPI parallel job thru SGE runs only if its running on a single node (i.e. 8processes on 8 cores of a single node). If number of processes is more than 8, then SGE will schedule it on 2 nodes - the job will fail with the above error. Now I did a loose integration of Open MPI 1.3 with SGE. The job runs, but all 16 processes run on a single node.
What are the entries in `qconf -sconf`for: rsh_command rsh_daemonWhat is your mpirun command in the jobscript - you are getting there the mpirun from Open MPI? According to the output below, it's not a loose integration, but you prepare alraedy a machinefile, which is superfluous for Open MPI.
$ cat out.83.Hello-OMPI /opt/gridengine/default/spool/node-0-17/active_jobs/83.1/pe_hostfile ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc17 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 ibc12 Greetings: 1 of 16 from the node node-0-17.local Greetings: 10 of 16 from the node node-0-17.local Greetings: 15 of 16 from the node node-0-17.local Greetings: 9 of 16 from the node node-0-17.local Greetings: 14 of 16 from the node node-0-17.local Greetings: 8 of 16 from the node node-0-17.local Greetings: 11 of 16 from the node node-0-17.local Greetings: 12 of 16 from the node node-0-17.local Greetings: 6 of 16 from the node node-0-17.local Greetings: 0 of 16 from the node node-0-17.local Greetings: 5 of 16 from the node node-0-17.local Greetings: 3 of 16 from the node node-0-17.local Greetings: 13 of 16 from the node node-0-17.local Greetings: 4 of 16 from the node node-0-17.local Greetings: 7 of 16 from the node node-0-17.local Greetings: 2 of 16 from the node node-0-17.localBut qhost -u <user name> shows that it is scheduled/running on two nodes.Any body successful in running Open MPI 1.3 tightly integrated with SGE?
For a Tight Integration there's a FAQ: http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge -- Reuti
Thanks, Sangamesh-- Reuti-------------------------------------------------------------------- ------ A daemon (pid 31947) died unexpectedly with status 129 while attemptingto launch so we are aborting.There may be more information reported by the environment (see above).This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have thelocation of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.-------------------------------------------------------------------- ------ -------------------------------------------------------------------- ------ mpirun noticed that the job aborted, but has no info as to the processthat caused that situation.-------------------------------------------------------------------- ------ssh_exchange_identification: Connection closed by remote host-------------------------------------------------------------------- ------ mpirun was unable to cleanly terminate the daemons on the nodes shownbelow. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance.-------------------------------------------------------------------- ------node-0-19.local - daemon did not report back when launched node-0-20.local - daemon did not report back when launched node-0-21.local - daemon did not report back when launched node-0-22.local - daemon did not report back when launchedThe hostnames for infiniband interfaces are ibc0, ibc1, ibc2 .. ibc23. May be Open MPI is not able to identify hosts as it is using node-0-... Is this causing open mpi to fail? Thanks, Sangamesh On Mon, Jan 26, 2009 at 5:09 PM, mihlon <vacl...@fel.cvut.cz> wrote:Hi,Hello SGE users, The cluster is installed with Rocks-4.3, SGE 6.0 & Open MPI 1.3. Open MPI is configured with "--with-sge". ompi_info shows only one component: # /opt/mpi/openmpi/1.3/intel/bin/ompi_info | grep gridengine MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) Is this acceptable?maybe yes see: http://www.open-mpi.org/faq/?category=building#build-rte-sge shell$ ompi_info | grep gridengine MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.3) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.3)(Specific frameworks and version numbers may vary, depending on yourversion of Open MPI.)The Open MPI parallel jobs run successfully through command line, butfail when run thru SGE(with -pe orte <slots>). The error is: $ cat err.26.Helloworld-PRL ssh_exchange_identification: Connection closed by remote host------------------------------------------------------------------ -------- A daemon (pid 8462) died unexpectedly with status 129 while attemptingto launch so we are aborting.There may be more information reported by the environment (see above).This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to havethe location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes.------------------------------------------------------------------ -------------------------------------------------------------------------- -------- mpirun noticed that the job aborted, but has no info as to the processthat caused that situation.------------------------------------------------------------------ --------mpirun: clean termination accomplishedBut the same job runs well, if it runs on a single node but with anerror: $ cat err.23.Helloworld-PRL libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations.------------------------------------------------------------------ --------WARNING: There was an error initializing an OpenFabrics device. Local host: node-0-4.local Local device: mthca0------------------------------------------------------------------ --------libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. [node-0-4.local:07869] 7 more processes have sent help message help-mpi-btl-openib.txt / error in device init[node-0-4.local:07869] Set MCA parameter "orte_base_help_aggregate" to0 to see all help / error messages The following link explains the same problem:http://gridengine.sunsource.net/ds/viewMessage.do? dsForumId=38&dsMessageId=72398With this reference, I put 'ulimit -l unlimited' into /etc/init.d/sgeexecd in all nodes. Restarted the services.Do not set 'ulimit -l unlimited' in /etc/init.d/sgeexecd but set it in the SGE: Run qconf -mconf and set execd_params frontend$> qconf -sconf ... execd_params H_MEMORYLOCKED=infinity ... Then restart all your sgeexecd hosts. MilanBut still the problem persists. What could be the way out for this? Thanks, Sangamesh ------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do? dsForumId=38&dsMessageId=99133To unsubscribe from this discussion, e-mail: [users-unsubscr...@gridengine.sunsource.net].------------------------------------------------------http://gridengine.sunsource.net/ds/viewMessage.do? dsForumId=38&dsMessageId=99461To unsubscribe from this discussion, e-mail: [users-unsubscr...@gridengine.sunsource.net]._______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users