This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you
confirm the following settings:
$ qconf -sp orte
...
control_slaves TRUE
$ qconf -sq all.q
...
shell_start_mode unix_behavior
-- Reuti
qconf -sp orte
pe_name orte
slots 448
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves FALSE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
and
qconf -sq all.q | grep start_
shell_start_mode posix_compliant
I've edited the env conf using `qconf -mp orte` changing
`control_slaves` to TRUE
# qconf -sp orte
pe_name orte
slots 448
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task TRUE
urgency_slots min
accounting_summary FALSE
and I've changed `shell_start_mode posix_compliant` to
`unix_behavior ` using `qconf -mconf`. (However, shell_start_mode is
still listed as posix_compliant )
Now, qsh -pe orte 4 works
qsh -pe orte 4
Your job 84581 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 84581 has been successfully scheduled.
(should I run that command before running any a new mpirun command ?)
when invoking:
qsub -cwd -pe orte 7 with-a-shell.sh
or
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun
/path/to/a.out arg1 arg2 arg3 ....
that works too ! Thank you ! :-)
queuename qtype resv/used/tot. load_avg
arch states
---------------------------------------------------------------------------------
all.q@node01 BIP 0/15/64 2.76 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 15
---------------------------------------------------------------------------------
all.q@node02 BIP 0/14/64 3.89 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
---------------------------------------------------------------------------------
all.q@node03 BIP 0/14/64 3.23 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
---------------------------------------------------------------------------------
all.q@node04 BIP 0/14/64 3.68 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
---------------------------------------------------------------------------------
all.q@node05 BIP 0/15/64 2.91 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 15
---------------------------------------------------------------------------------
all.q@node06 BIP 0/14/64 3.91 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
---------------------------------------------------------------------------------
all.q@node07 BIP 0/14/64 3.79 lx24-amd64
84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
OK, my first openmpi program works. But as far as I can see: it is
faster when invoked on the master node (~3.22min) than when invoked by
means of SGE (~7H45):
time /commun/data/packages/openmpi/bin/mpirun -np 7 /path/to/a.out
arg1 arg2 arg3 ....
670.985u 64.929s 3:32.36 346.5% 0+0k 16322112+6560io 32pf+0w
time qrsh -cwd -pe orte 7 /commun/data/packages/openmpi/bin/mpirun
/path/to/a.out arg1 arg2 arg3 ....
0.023u 0.036s 7:45.05 0.0% 0+0k 1496+0io 1pf+0w
I'm going to investigate this... :-)
Thank you again
Pierre