Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

Derrick LIN Fri, 15 Apr 2011 17:02:25 -0400

>
> - what is your SGE configuration `qconf -sconf`?


#global:
execd_spool_dir              /var/spool/gridengine/execd
mailer                       /usr/bin/mail
xterm                        /usr/bin/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 bash,sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           root
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=false \
                             flush_time=00:00:15 joblog=false
sharelog=00:00:00
finished_jobs                100
gid_range                    65400-65500
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 false
rlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
qlogin_daemon                /usr/sbin/sshd -i
qlogin_command               /usr/share/gridengine/qlogin-wrapper
rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh
jsv_url                      none
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w

# my queue setting is:

qname                 dev.q
hostlist              sgeqexec01.domain.com.au
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make orte
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

# my PE setting is:

pe_name            orte
slots              4
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE


> a) you are testing from master to a node, but jobs are running between
> nodes.


> b) unless you need X11 forwarding, using SGE’s -builtin- communication
> works fine, this way you can have a cluster without `rsh` or `ssh` (or
> limited to admin staff) and can still run parallel jobs.
>

Sorry for the misleading snip. All the hosts (both master and execution
host) in the cluster can powerwordless each other without an issue. As my 2)
states, I could run a generic openmpi job without the SGE successfully. So I
do not think is the communication issue?


> Then you are bypassing SGE’s slot allocation and will have wrong accounting
> and no job control of the slave tasks.
>

I know it's not a proper submission as a PE job. I simply ran out of idea
what to do next. Even it's not a proper way, but that openmpi error didn't
happen and the job completed. I am wondering why.


The correct version of my OpenMPI is 1.4.1, not 1.3 in my first post.

I have installed OpenMPI on the submission host and the master later, but it
didn't help. So I guess OpenMPI is needed in execution hosts only.

Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed) (Reuti)

Reply via email to