Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

Reuti Fri, 15 Apr 2011 18:23:34 -0400

Am 15.04.2011 um 23:02 schrieb Derrick LIN:

> - what is your SGE configuration `qconf -sconf`?
>  
> <snip>
> rlogin_daemon                /usr/sbin/sshd -i
> rlogin_command               /usr/bin/ssh
> qlogin_daemon                /usr/sbin/sshd -i
> qlogin_command               /usr/share/gridengine/qlogin-wrapper
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh


So you route the SGE startup mechanism to use `ssh`, nevertherless is should 
work of course. Small difference to a conventional `ssh` is, that SGE will 
start a private daemon for each job on the nodes listening on a random port.

When you use only one host, then forks will be created but no `ssh` call. Your 
test uses more than one node?

You copied you SGE aware version to all nodes at the same location? Are you 
getting the correct `mpiexec` and shared libraries in your jobscript? Shows the 
output of:

#!/bin/sh
which mpiexec
echo $LD_LIBRARY_PATH
ldd ompi_job

the expected ones (ompi_job is the binary and ompi_job.sh the script) when 
submitted with a PE request?

-- Reuti


> jsv_url                      none
> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
> 
> # my queue setting is:
> 
> qname                 dev.q
> hostlist              sgeqexec01.domain.com.au
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make orte
> rerun                 FALSE
> slots                 8
> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> # my PE setting is:
> 
> pe_name            orte
> slots              4
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $round_robin
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary FALSE
>  
> a) you are testing from master to a node, but jobs are running between nodes.
> 
> b) unless you need X11 forwarding, using SGE’s -builtin- communication works 
> fine, this way you can have a cluster without `rsh` or `ssh` (or limited to 
> admin staff) and can still run parallel jobs.
> 
> Sorry for the misleading snip. All the hosts (both master and execution host) 
> in the cluster can powerwordless each other without an issue. As my 2) 
> states, I could run a generic openmpi job without the SGE successfully. So I 
> do not think is the communication issue?
>  
> Then you are bypassing SGE’s slot allocation and will have wrong accounting 
> and no job control of the slave tasks.
>  
> I know it's not a proper submission as a PE job. I simply ran out of idea 
> what to do next. Even it's not a proper way, but that openmpi error didn't 
> happen and the job completed. I am wondering why.
> 
> 
> The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.
> 
> I have installed OpenMPI on the submission host and the master later, but it 
> didn't help. So I guess OpenMPI is needed in execution hosts only.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

Reply via email to