This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you 
confirm the following settings:

$ qconf -sp orte
...
control_slaves     TRUE

$ qconf -sq all.q
...
shell_start_mode      unix_behavior

-- Reuti

   qconf -sp orte

   pe_name            orte
   slots              448
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    /bin/true
   stop_proc_args     /bin/true
   allocation_rule    $round_robin
   control_slaves     FALSE
   job_is_first_task  TRUE
   urgency_slots      min
   accounting_summary FALSE


and

     qconf -sq all.q | grep start_
   shell_start_mode      posix_compliant



I've edited the env conf using `qconf -mp orte` changing `control_slaves` to TRUE


   # qconf -sp orte
   pe_name            orte
   slots              448
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    /bin/true
   stop_proc_args     /bin/true
   allocation_rule    $round_robin
   control_slaves     TRUE
   job_is_first_task  TRUE
   urgency_slots      min
   accounting_summary FALSE

and I've changed `shell_start_mode posix_compliant` to `unix_behavior ` using `qconf -mconf`. (However, shell_start_mode is still listed as posix_compliant )

Now, qsh -pe orte 4 works

   qsh -pe orte 4
   Your job 84581 ("INTERACTIVE") has been submitted
   waiting for interactive job to be scheduled ...
   Your interactive job 84581 has been successfully scheduled.


(should I run that command before running any a new mpirun command ?)

when invoking:

     qsub -cwd -pe orte 7 with-a-shell.sh
or
qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun /path/to/a.out arg1 arg2 arg3 ....

that works too ! Thank you ! :-)


   queuename                      qtype resv/used/tot. load_avg
   arch          states
   
---------------------------------------------------------------------------------
   all.q@node01                   BIP   0/15/64        2.76 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   
---------------------------------------------------------------------------------
   all.q@node02                   BIP   0/14/64        3.89 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   
---------------------------------------------------------------------------------
   all.q@node03                   BIP   0/14/64        3.23 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   
---------------------------------------------------------------------------------
   all.q@node04                   BIP   0/14/64        3.68 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   
---------------------------------------------------------------------------------
   all.q@node05                   BIP   0/15/64        2.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    15
   
---------------------------------------------------------------------------------
   all.q@node06                   BIP   0/14/64        3.91 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14
   
---------------------------------------------------------------------------------
   all.q@node07                   BIP   0/14/64        3.79 lx24-amd64
      84598 0.55500 mpirun     lindenb      r     02/11/2013 12:03:36    14



OK, my first openmpi program works. But as far as I can see: it is faster when invoked on the master node (~3.22min) than when invoked by means of SGE (~7H45):


time /commun/data/packages/openmpi/bin/mpirun -np 7 /path/to/a.out arg1 arg2 arg3 ....
   670.985u 64.929s 3:32.36 346.5%    0+0k 16322112+6560io 32pf+0w

   time qrsh -cwd -pe orte 7 /commun/data/packages/openmpi/bin/mpirun
   /path/to/a.out  arg1 arg2 arg3 ....
   0.023u 0.036s 7:45.05 0.0%    0+0k 1496+0io 1pf+0w



I'm going to investigate this... :-)

Thank you again

Pierre


Reply via email to