Re: [OMPI users] I have still a problem with rankfiles in openmpi-1.6.4rc3

2013-02-08 Thread Siegmar Gross
Hi

today I tried a different rankfile and got once more a problem. :-((

> > thank you very much for your patch. I have applied the patch to
> > openmpi-1.6.4rc4.
> > 
> > Open MPI: 1.6.4rc4r28022
> > : [B .][. .] (slot list 0:0)
> > : [. B][. .] (slot list 0:1)
> > : [B B][. .] (slot list 0:0-1)
> > : [. .][B .] (slot list 1:0)
> > : [. .][. B] (slot list 1:1)
> > : [. .][B B] (slot list 1:0-1)
> > : [B B][B B] (slot list 0:0-1,1:0-1)
> 
> That looks great.  I'll file a CMR to get this patch into 1.6.
> Unless you indicate otherwise, I'll assume this issue is understood 
> for 1.6.

Rankfile rf_6 is the same as last time. I have added one more
line in rf_7 and I switched the sequence of the hosts in rf_8.
Everything is still fine with rf_6. I don't get any output for
rank 1 with rf_7 and I get an error for rf_8. Both machines
use the same hardware.


sunpc1 rankfiles 106 cat rf_6
# mpiexec -report-bindings -rf rf_6 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1

sunpc1 rankfiles 107 cat rf_7
# mpiexec -report-bindings -rf rf_7 hostname
rank 0=sunpc1 slot=0:0-1,1:0-1
rank 1=sunpc0 slot=0:0-1

sunpc1 rankfiles 108 cat rf_8
# mpiexec -report-bindings -rf rf_8 hostname
rank 0=sunpc0 slot=0:0-1,1:0-1
rank 1=sunpc1 slot=0:0-1


sunpc1 rankfiles 109 mpiexec -report-bindings -rf rf_6 hostname
[sunpc1:09779] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 110 mpiexec -report-bindings -rf rf_7 hostname
[sunpc1:09782] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)

sunpc1 rankfiles 111 mpiexec -report-bindings -rf rf_8 hostname
--
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc0
--



I get the following output, if I use sunpc0 as local host.

sunpc0 rankfiles 102 mpiexec -report-bindings -rf rf_6 hostname
--
All nodes which are allocated for this job are already filled.
--

sunpc0 rankfiles 103 mpiexec -report-bindings -rf rf_7 hostname
--
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: sunpc1
--

sunpc0 rankfiles 104 mpiexec -report-bindings -rf rf_8 hostname
[sunpc0:19027] MCW rank 0 bound to socket 0[core 0-1]
  socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1)


I get the following output, if I use tyr as local host.

tyr rankfiles 218 mpiexec -report-bindings -rf rf_6 hostname
--
All nodes which are allocated for this job are already filled.
--

tyr rankfiles 219 mpiexec -report-bindings -rf rf_7 hostname
--
All nodes which are allocated for this job are already filled.
--

tyr rankfiles 220 mpiexec -report-bindings -rf rf_8 hostname
--
All nodes which are allocated for this job are already filled.
--



Do you have any ideas why this happens? Thank you very much for
any help in advance.


Kind regards

Siegmar



[OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)

2013-02-08 Thread Pierre LINDENBAUM
( cross-posted on SO: http://stackoverflow.com/questions/14775451 )

Hi,
I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE:


I've installed openmpi , not in
  /usr/...
but in
   /commun/data/packages/openmpi/

it was compiled with --with-sge.

I've added a new PE in SGE with qconf as descibed in
http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html

  # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)

  # qconf -sq all.q | grep pe_
  pe_list   make orte

Without SGE, the program runs without any problem, using several processors.

   /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args

Now I want to submit my program to SGE

In the Open MPI FAQ, I read:

  # Allocate a SGE interactive job with 4 slots
  # from a parallel environment (PE) named 'orte'
  shell$ qsh -pe orte 4

but my output is:

   qsh -pe orte 4
   Your job 84550 ("INTERACTIVE") has been submitted
   waiting for interactive job to be scheduled ...
   Could not start interactive job.

I've also tried the mpirun command embedded in a script:

   $ cat ompi.sh
   #!/bin/sh
   /commun/data/packages/openmpi/bin/mpirun  \
 /path/to/a.out args

but it fails

  $ cat ompi.sh.e84552
  error: executing task of job 84552 failed: execution daemon on host
"node02" didn't accept task
   --
  A daemon (pid 18327) died unexpectedly with status 1 while attempting
  to launch so we are aborting.

  There may be more information reported by the environment (see above).

  This may be because the daemon was unable to find all the needed shared
  libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
  location of the shared libraries on the remote nodes and this will
  automatically be forwarded to the remote nodes.
  --
  error: executing task of job 84552 failed: execution daemon on host
"node01" didn't accept task
  --
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.

How can I fix this?

Many thanks



Re: [OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)

2013-02-08 Thread Reuti
Hi,

Am 08.02.2013 um 19:36 schrieb Pierre LINDENBAUM:

> ( cross-posted on SO: http://stackoverflow.com/questions/14775451 )
> I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE:
> 
> 
> I've installed openmpi , not in
>  /usr/...
> but in
>   /commun/data/packages/openmpi/
> 
> it was compiled with --with-sge.
> 
> I've added a new PE in SGE with qconf as descibed in
> http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html
> 
>  # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
>  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)
> 
>  # qconf -sq all.q | grep pe_
>  pe_list   make orte
> 
> Without SGE, the program runs without any problem, using several processors.
> 
>   /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args
> 
> Now I want to submit my program to SGE
> 
> In the Open MPI FAQ, I read:
> 
>  # Allocate a SGE interactive job with 4 slots
>  # from a parallel environment (PE) named 'orte'
>  shell$ qsh -pe orte 4
> 
> but my output is:
> 
>   qsh -pe orte 4
>   Your job 84550 ("INTERACTIVE") has been submitted
>   waiting for interactive job to be scheduled ...
>   Could not start interactive job.

An INTERACTIVE job is more like an immediate job, i.e. "-now y". Do you have 
any interactive queue configured and the cluster is empty right now?


> I've also tried the mpirun command embedded in a script:
> 
>   $ cat ompi.sh
>   #!/bin/sh
>   /commun/data/packages/openmpi/bin/mpirun  \
> /path/to/a.out args
> 
> but it fails
> 
>  $ cat ompi.sh.e84552
>  error: executing task of job 84552 failed: execution daemon on host
> "node02" didn't accept task

This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you 
confirm the following settings:

$ qconf -sp orte
...
control_slaves TRUE

$ qconf -sq all.q
...
shell_start_mode  unix_behavior

-- Reuti


>   --
>  A daemon (pid 18327) died unexpectedly with status 1 while attempting
>  to launch so we are aborting.
> 
>  There may be more information reported by the environment (see above).
> 
>  This may be because the daemon was unable to find all the needed shared
>  libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>  location of the shared libraries on the remote nodes and this will
>  automatically be forwarded to the remote nodes.
>  --
>  error: executing task of job 84552 failed: execution daemon on host
> "node01" didn't accept task
>  --
>  mpirun noticed that the job aborted, but has no info as to the process
>  that caused that situation.
> 
> How can I fix this?
> 
> Many thanks
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Hi, I am working on topic "Topology aware mapping of processes in intra node environment". I need to find the binding of each rank on the local machine. How do I do this? I am using OPENM

2013-02-08 Thread Kranthi Kumar
Hi,
I am working on topic "Topology aware mapping of processes in intra node
environment". I need to find the binding of each rank on the local machine.
How do I do this? I am using OPENMPI version 1.4.1

Thank You

-- 
Kranthi


Re: [OMPI users] Hi, I am working on topic "Topology aware mapping of processes in intra node environment". I need to find the binding of each rank on the local machine. How do I do this? I am using O

2013-02-08 Thread Ralph Castain
Ummm...you might want to look at the developer's trunk as we do topology aware 
mapping today. Will be released soon in the 1.7.0 release.

On Feb 8, 2013, at 5:48 PM, Kranthi Kumar  wrote:

> Hi,
> I am working on topic "Topology aware mapping of processes in intra node 
> environment". I need to find the binding of each rank on the local machine.
> How do I do this? I am using OPENMPI version 1.4.1
> 
> Thank You
> 
> -- 
> Kranthi
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users