date:20131210

[OMPI users] Fwd: Open MPI can't launch remote nodes via SSH

2013-12-10 Thread 周文平

-- Forwarded message --
From: 周文平 
List-Post: users@lists.open-mpi.org
Date: 2013/12/10
Subject: Open MPI can't launch remote nodes via SSH
To: devel 

I am trying to set up Open MPI between a few machines on out network. Open
MPI works fine locally, but I just can't get it to work on a remote node. I
can ssh into the remote machine (without password) just fine, but if I try
something like

mpiexec -n 2 -H node1,node2 /usr/local/mpihouse/MPIDemo

then the ssh connection just endless waiting ,no result,no error reports!!!

why i can't load the remote process.tar.gz
Description: GNU Zip compressed data

Re: [OMPI users] Fwd: Open MPI can't launch remote nodes via SSH

2013-12-10 Thread Jeff Squyres (jsquyres)

Please send all the information listed here:

http://www.open-mpi.org/community/help/


On Dec 10, 2013, at 5:49 AM, 周文平  wrote:

> 
> 
> -- Forwarded message --
> From: 周文平 
> Date: 2013/12/10
> Subject: Open MPI can't launch remote nodes via SSH
> To: devel 
> 
> 
> I am trying to set up Open MPI between a few machines on out network. Open 
> MPI works fine locally, but I just can't get it to work on a remote node. I 
> can ssh into the remote machine (without password) just fine, but if I try 
> something like
> 
> mpiexec -n 2 -H node1,node2 /usr/local/mpihouse/MPIDemo
> 
> then the ssh connection just endless waiting ,no result,no error reports!!!
> 
> 
>  process.tar.gz>___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-12-10 Thread Dave Love

George Bosilca  writes:

>> No. The Fortran status must __always__ be 6, because we need enough room to 
>> correctly convert the 3 useful variables to Fortran, plus copy the rest of 
>> the hidden things.These 6 type will be INTEGER (which will then be different 
>> than the C int). The C<->F stuff will do not a memcpy but copy all elements 
>> while casting to the correct Fortran type (ompi_fortran_integer_t).
>> 
>> The fact that we are talking about 3 integers in the Fortran status might 
>> explain the segfault. This number should never be 3 it should be ALWAYS 6 or 
>> the function MPI_Status_c2f will clearly overwrite the memory.
>
> I did manage to try this idea on an -i8 enabled Open MPI version. The test 
> application provided on one of the early email successfully complete without 
> the segfault and with a correct output.
>
>  node= 0 and size=  2 : Hello
>  node= 1 and size=  2 : Hello
>  Iam = 1 and my temp value is =  1
>  Iam = 0 and my temp value is =  1
>
> So the fix is trivial, make MPI_STATUS_SIZE always equal to 
> sizeof(MPI_Status)/sizeof(int). Everything else is already taken care of by 
> the current Fortran <-> C infrastructure.

This doesn't seem to have been fixed, and I think it's going to bite
here.  Is this the right change?

--- openmpi-1.6.5/ompi/config/ompi_setup_mpi_fortran.m4~2012-04-03 
15:30:24.0 +0100
+++ openmpi-1.6.5/ompi/config/ompi_setup_mpi_fortran.m4 2013-12-10 
12:23:54.232854527 +
@@ -127,8 +127,8 @@
 AC_MSG_RESULT([skipped (no Fortran bindings)])
 else
 bytes=`expr 4 \* $ac_cv_sizeof_int + $ac_cv_sizeof_size_t`
-num_integers=`expr $bytes / $OMPI_SIZEOF_FORTRAN_INTEGER`
-sanity=`expr $num_integers \* $OMPI_SIZEOF_FORTRAN_INTEGER`
+num_integers=`expr $bytes / $ac_cv_sizeof_int`
+sanity=`expr $num_integers \* $ac_cv_sizeof_int`
 AS_IF([test "$sanity" != "$bytes"],
   [AC_MSG_RESULT([unknown!])
AC_MSG_WARN([WARNING: Size of C int: $ac_cv_sizeof_int])

Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

2013-12-10 Thread Raiden Hasegawa

Thanks Jeff,

It turns out this was an issue with Homebrew (package manager for mac) and
not related to open-mpi...

If any Homebrew users have this issue in the future when installing
open-mpi here's what happened: there were some non-Homebrewed 32-bit
gfortran libraries floating around in the lib directory Homebrew uses,
which were being called instead of the correct Homebrewed 64 bit libraries.

Best,

Raiden


On Mon, Dec 2, 2013 at 4:02 PM, Jeff Squyres (jsquyres)
wrote:

> I did notice that you have an oddity:
>
> - I see /usr/local/opt/gfortran/bin in your PATH (line 41 in config.log)
> - I see that configure is invoking /usr/local/bin/gfortran (line 7630 and
> elsewhere in config.log)
>
> That implies that you have 2 different gfortrans installed on your
> machine, one of which may be faulty, or may accidentally be referring to
> the libraries of the other (therefore resulting in Badness).
>
>
>
> On Dec 2, 2013, at 3:52 PM, Raiden Hasegawa 
> wrote:
>
> > Yes, what I meant is that when running:
> >
> > /usr/local/bin/gfortran -o conftestconftest.f
> >
> > outside of configure it does work.
> >
> > I don't think I have DYLD_LIBRARY_PATH set, but I will check when I get
> back to my home computer.
> >
> >
> > On Mon, Dec 2, 2013 at 3:47 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > On Dec 2, 2013, at 3:00 PM, Raiden Hasegawa 
> wrote:
> >
> > > Thanks, Jeff.  The compiler does in fact work when running the
> troublesome line in ./configure.
> >
> > Errr... I'm not sure how to parse that.  The config.log you cited shows
> that the compiler does *not* work in configure:
> >
> > -
> > configure:29606: checking if Fortran compiler works
> > configure:29635: /usr/local/bin/gfortran -o conftestconftest.f  >&5
> > Undefined symbols for architecture x86_64:
> >   "__gfortran_set_options", referenced from:
> >   _main in cccSAmNO.o
> > ld: symbol(s) not found for architecture x86_64
> > collect2: error: ld returned 1 exit status
> > configure:29635: $? = 1
> > configure: program exited with status 1
> > configure: failed program was:
> > |   program main
> > |
> > |   end
> > configure:29651: result: no
> > configure:29665: error: Could not run a simple Fortran program.
>  Aborting.
> > -
> >
> > Did you typo and mean that the compiler does work when outside of
> configure, and fails when it is inside configure?
> >
> > > I haven't set either FC, FCFLAGS nor do I have LD_LIBRARY_PATH set in
> my .bashrc.  Do you have any thoughts on what environmental variable may
> trip this up?
> >
> > Do you have DYLD_LIBRARY_PATH set?
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima



Hi Ralph,

I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.

It stopped the error but unfortunately "mapping by socket" itself didn't
work
well as shown bellow:

[mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
qsub: waiting for job 8260.manage.cluster to start
qsub: job 8260.manage.cluster ready

[mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
-map-by socket myprog
[node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
 socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
 socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
 socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
 socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
Hello world from process 2 of 8
Hello world from process 1 of 8
Hello world from process 3 of 8
Hello world from process 0 of 8
Hello world from process 6 of 8
Hello world from process 5 of 8
Hello world from process 4 of 8
Hello world from process 7 of 8

I think this should be like this:

rank 00
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
rank 01
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
rank 02
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
...

Regards,
Tetsuya Mishima

> I fixed this under the trunk (was an issue regardless of RM) and have
scheduled it for 1.7.4.
>
> Thanks!
> Ralph
>
> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > Thank you very much for your quick response.
> >
> > I'm afraid to say that I found one more issuse...
> >
> > It's not so serious. Please check it when you have a lot of time.
> >
> > The problem is cpus-per-proc with -map-by option under Torque manager.
> > It doesn't work as shown below. I guess you can get the same
> > behaviour under Slurm manager.
> >
> > Of course, if I remove -map-by option, it works quite well.
> >
> > [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
> > qsub: waiting for job 8116.manage.cluster to start
> > qsub: job 8116.manage.cluster ready
> >
> > [mishima@node03 ~]$ cd ~/Ducom/testbed2
> > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> > -map-by socket mPre
> >
--
> > A request was made to bind to that would result in binding more
> > processes than cpus on a resource:
> >
> >   Bind to: CORE
> >   Node:node03
> >   #processes:  2
> >   #cpus:  1
> >
> > You can override this protection by adding the "overload-allowed"
> > option to your binding directive.
> >
--
> >
> >
> > [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
4
> > mPre
> > [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > ocket 1[core 11[hwt 0]]:
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]],
socket
> > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > socket 1[core 15[hwt 0]]:
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > [node03.cluster:18128] MCW rank 4 bound to socket 2[core 16[hwt 0]],
socket
> > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > socket 2[core 19[hwt 0]]:
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > [node03.cluster:18128] MCW r

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain

No, that is actually correct. We map a socket until full, then move to the 
next. What you want is --map-by socket:span

On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.
> 
> It stopped the error but unfortunately "mapping by socket" itself didn't
> work
> well as shown bellow:
> 
> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
> qsub: waiting for job 8260.manage.cluster to start
> qsub: job 8260.manage.cluster ready
> 
> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket myprog
> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> Hello world from process 2 of 8
> Hello world from process 1 of 8
> Hello world from process 3 of 8
> Hello world from process 0 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 4 of 8
> Hello world from process 7 of 8
> 
> I think this should be like this:
> 
> rank 00
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> rank 01
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> rank 02
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> ...
> 
> Regards,
> Tetsuya Mishima
> 
>> I fixed this under the trunk (was an issue regardless of RM) and have
> scheduled it for 1.7.4.
>> 
>> Thanks!
>> Ralph
>> 
>> On Nov 25, 2013, at 4:22 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> Thank you very much for your quick response.
>>> 
>>> I'm afraid to say that I found one more issuse...
>>> 
>>> It's not so serious. Please check it when you have a lot of time.
>>> 
>>> The problem is cpus-per-proc with -map-by option under Torque manager.
>>> It doesn't work as shown below. I guess you can get the same
>>> behaviour under Slurm manager.
>>> 
>>> Of course, if I remove -map-by option, it works quite well.
>>> 
>>> [mishima@manage testbed2]$ qsub -I -l nodes=1:ppn=32
>>> qsub: waiting for job 8116.manage.cluster to start
>>> qsub: job 8116.manage.cluster ready
>>> 
>>> [mishima@node03 ~]$ cd ~/Ducom/testbed2
>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 4
>>> -map-by socket mPre
>>> 
> --
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>> 
>>>  Bind to: CORE
>>>  Node:node03
>>>  #processes:  2
>>>  #cpus:  1
>>> 
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>> 
> --
>>> 
>>> 
>>> [mishima@node03 testbed2]$ mpirun -np 8 -report-bindings -cpus-per-proc
> 4
>>> mPre
>>> [node03.cluster:18128] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]:
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> [node03.cluster:18128] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> socket
>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>> socket

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima



Hi Ralph,

Thanks. I didn't know the meaning of "socket:span".

But it still causes the problem, which seems socket:span doesn't work.

[mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
qsub: waiting for job 8265.manage.cluster to start
qsub: job 8265.manage.cluster ready

[mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
-map-by socket:span myprog
[node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
 socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
 socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
 socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
 socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
Hello world from process 0 of 8
Hello world from process 3 of 8
Hello world from process 1 of 8
Hello world from process 4 of 8
Hello world from process 6 of 8
Hello world from process 5 of 8
Hello world from process 2 of 8
Hello world from process 7 of 8

Regards,
Tetsuya Mishima

> No, that is actually correct. We map a socket until full, then move to
the next. What you want is --map-by socket:span
>
> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.
> >
> > It stopped the error but unfortunately "mapping by socket" itself
didn't
> > work
> > well as shown bellow:
> >
> > [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
> > qsub: waiting for job 8260.manage.cluster to start
> > qsub: job 8260.manage.cluster ready
> >
> > [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> > [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> > -map-by socket myprog
> > [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> > ocket 1[core 11[hwt 0]]:
> > [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> > [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]],
socket
> > 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> > socket 1[core 15[hwt 0]]:
> > [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> > [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]],
socket
> > 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> > socket 2[core 19[hwt 0]]:
> > [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> > [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]],
socket
> > 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> > socket 2[core 23[hwt 0]]:
> > [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> > [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]],
socket
> > 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> > socket 3[core 27[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> > [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]],
socket
> > 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> > socket 3[core 31[hwt 0]]:
> > [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> > [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]:
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > [node04.cluster:27489] MCW rank 1 bound to socket 0[core 4[hwt 0]],
socket
> > 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> > cket 0[core 7[hwt 0]]:
> > [././././B/B/B/B][./././././././.][./././././././.][./././.

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain

Hmmm...that's strange. I only have 2 sockets on my system, but let me poke 
around a bit and see what might be happening.

On Dec 10, 2013, at 4:47 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> Thanks. I didn't know the meaning of "socket:span".
> 
> But it still causes the problem, which seems socket:span doesn't work.
> 
> [mishima@manage demos]$ qsub -I -l nodes=node03:ppn=32
> qsub: waiting for job 8265.manage.cluster to start
> qsub: job 8265.manage.cluster ready
> 
> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket:span myprog
> [node03.cluster:10262] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node03.cluster:10262] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node03.cluster:10262] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node03.cluster:10262] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node03.cluster:10262] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:10262] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> Hello world from process 0 of 8
> Hello world from process 3 of 8
> Hello world from process 1 of 8
> Hello world from process 4 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 2 of 8
> Hello world from process 7 of 8
> 
> Regards,
> Tetsuya Mishima
> 
>> No, that is actually correct. We map a socket until full, then move to
> the next. What you want is --map-by socket:span
>> 
>> On Dec 10, 2013, at 3:42 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I had a time to try your patch yesterday using openmpi-1.7.4a1r29646.
>>> 
>>> It stopped the error but unfortunately "mapping by socket" itself
> didn't
>>> work
>>> well as shown bellow:
>>> 
>>> [mishima@manage demos]$ qsub -I -l nodes=1:ppn=32
>>> qsub: waiting for job 8260.manage.cluster to start
>>> qsub: job 8260.manage.cluster ready
>>> 
>>> [mishima@node04 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima@node04 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
>>> -map-by socket myprog
>>> [node04.cluster:27489] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
>>> ocket 1[core 11[hwt 0]]:
>>> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 3 bound to socket 1[core 12[hwt 0]],
> socket
>>> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
>>> socket 1[core 15[hwt 0]]:
>>> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> socket
>>> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
>>> socket 2[core 19[hwt 0]]:
>>> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
>>> [node04.cluster:27489] MCW rank 5 bound to socket 2[core 20[hwt 0]],
> socket
>>> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
>>> socket 2[core 23[hwt 0]]:
>>> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
>>> [node04.cluster:27489] MCW rank 6 bound to socket 3[core 24[hwt 0]],
> socket
>>> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
>>> socket 3[core 27[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
>>> [node04.cluster:27489] MCW rank 7 bound to socket 3[core 28[hwt 0]],
> socket
>>> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
>>> socket 3[core 31[hwt 0]]:
>>> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
>>> [node04.cluster:27489] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[h

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima



Hi Ralph,

I tried again with -cpus-per-proc 2 as shown below.
Here, I found that "-map-by socket:span" worked well.

[mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
-map-by socket:span myprog
[node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]]: [./././././././.][B/B/././.
/././.][./././././././.][./././././././.]
[node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], socket
1[core 11[hwt 0]]: [./././././././.][././B/B
/./././.][./././././././.][./././././././.]
[node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]]: [./././././././.][./././.
/./././.][B/B/./././././.][./././././././.]
[node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], socket
2[core 19[hwt 0]]: [./././././././.][./././.
/./././.][././B/B/./././.][./././././././.]
[node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]]: [./././././././.][./././.
/./././.][./././././././.][B/B/./././././.]
[node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], socket
3[core 27[hwt 0]]: [./././././././.][./././.
/./././.][./././././././.][././B/B/./././.]
[node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]]: [B/B/./././././.][././././.
/././.][./././././././.][./././././././.]
[node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
0[core 3[hwt 0]]: [././B/B/./././.][././././.
/././.][./././././././.][./././././././.]
Hello world from process 1 of 8
Hello world from process 0 of 8
Hello world from process 4 of 8
Hello world from process 2 of 8
Hello world from process 7 of 8
Hello world from process 6 of 8
Hello world from process 5 of 8
Hello world from process 3 of 8
[mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
-map-by socket myprog
[node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]]: [././././B/B/./.][././././.
/././.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket
0[core 7[hwt 0]]: [././././././B/B][././././.
/././.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]]: [./././././././.][B/B/././.
/././.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket
1[core 11[hwt 0]]: [./././././././.][././B/B
/./././.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]]: [./././././././.][./././.
/B/B/./.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket
1[core 15[hwt 0]]: [./././././././.][./././.
/././B/B][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]]: [B/B/./././././.][././././.
/././.][./././././././.][./././././././.]
[node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
0[core 3[hwt 0]]: [././B/B/./././.][././././.
/././.][./././././././.][./././././././.]
Hello world from process 5 of 8
Hello world from process 1 of 8
Hello world from process 6 of 8
Hello world from process 4 of 8
Hello world from process 2 of 8
Hello world from process 0 of 8
Hello world from process 7 of 8
Hello world from process 3 of 8

"-np 8" and "-cpus-per-proc 4" just filled all sockets.
In this case, I guess "-map-by socket:span" and "-map-by socket" has same
meaning.
Therefore, there's no problem about that. Sorry for distubing.

By the way, through this test, I found another problem.
Without torque manager and just using rsh, it causes the same error like
below:

[mishima@manage openmpi-1.7]$ rsh node03
Last login: Wed Dec 11 09:42:02 from manage
[mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
-map-by socket myprog
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:node03
   #processes:  2
   #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--
[mishima@node03 demos]$
[mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
myprog
[node03.cluster:11036] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node03.cluster:11036] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain


On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I tried again with -cpus-per-proc 2 as shown below.
> Here, I found that "-map-by socket:span" worked well.
> 
> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
> -map-by socket:span myprog
> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]], socket
> 1[core 11[hwt 0]]: [./././././././.][././B/B
> /./././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]]: [./././././././.][./././.
> /./././.][B/B/./././././.][./././././././.]
> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]], socket
> 2[core 19[hwt 0]]: [./././././././.][./././.
> /./././.][././B/B/./././.][./././././././.]
> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]]: [./././././././.][./././.
> /./././.][./././././././.][B/B/./././././.]
> [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]], socket
> 3[core 27[hwt 0]]: [./././././././.][./././.
> /./././.][./././././././.][././B/B/./././.]
> [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> /././.][./././././././.][./././././././.]
> Hello world from process 1 of 8
> Hello world from process 0 of 8
> Hello world from process 4 of 8
> Hello world from process 2 of 8
> Hello world from process 7 of 8
> Hello world from process 6 of 8
> Hello world from process 5 of 8
> Hello world from process 3 of 8
> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
> -map-by socket myprog
> [node03.cluster:10921] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]]: [././././B/B/./.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 3 bound to socket 0[core 6[hwt 0]], socket
> 0[core 7[hwt 0]]: [././././././B/B][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 4 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 5 bound to socket 1[core 10[hwt 0]], socket
> 1[core 11[hwt 0]]: [./././././././.][././B/B
> /./././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 6 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]]: [./././././././.][./././.
> /B/B/./.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 7 bound to socket 1[core 14[hwt 0]], socket
> 1[core 15[hwt 0]]: [./././././././.][./././.
> /././B/B][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> /././.][./././././././.][./././././././.]
> [node03.cluster:10921] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket
> 0[core 3[hwt 0]]: [././B/B/./././.][././././.
> /././.][./././././././.][./././././././.]
> Hello world from process 5 of 8
> Hello world from process 1 of 8
> Hello world from process 6 of 8
> Hello world from process 4 of 8
> Hello world from process 2 of 8
> Hello world from process 0 of 8
> Hello world from process 7 of 8
> Hello world from process 3 of 8
> 
> "-np 8" and "-cpus-per-proc 4" just filled all sockets.
> In this case, I guess "-map-by socket:span" and "-map-by socket" has same
> meaning.
> Therefore, there's no problem about that. Sorry for distubing.

No problem - glad you could clear that up :-)

> 
> By the way, through this test, I found another problem.
> Without torque manager and just using rsh, it causes the same error like
> below:
> 
> [mishima@manage openmpi-1.7]$ rsh node03
> Last login: Wed Dec 11 09:42:02 from manage
> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 4
> -map-by socket myprog

I don't understand the difference here - you are simply starting it from a 
different node? It looks like everything is expected to run local to mpirun, 
yes? So there is no rsh actually involved here. Are you still running in an 
allocation?

If you run this with "-host node03" on the cmd line, do you see the same 
problem?


> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:node03
>   #processes:  2
>   #cpus:  1
> 
> You can override this protection by adding the "overload-allowed"
> option

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread tmishima



Hi Ralph, sorry for confusing.

We usually logon to "manage", which is our control node.
>From manage, we submit job or enter a remote node such as
node03 by torque interactive mode(qsub -I).

At that time, instead of torque, I just did rsh to node03 from manage
and ran myprog on the node. I hope you could understand what I did.

Now, I retried with "-host node03", which still causes the problem:
(I comfirmed local run on manage caused the same problem too)

[mishima@manage ~]$ rsh node03
Last login: Wed Dec 11 11:38:57 from manage
[mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node03 demos]$
[mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
-cpus-per-proc 4 -map-by socket myprog
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:node03
   #processes:  2
   #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

It' strange, but I have to report that "-map-by socket:span" worked well.

[mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
-cpus-per-proc 4 -map-by socket:span myprog
[node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
ocket 1[core 11[hwt 0]]:
[./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
[node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
 socket 1[core 15[hwt 0]]:
[./././././././.][././././B/B/B/B][./././././././.][./././././././.]
[node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
 socket 2[core 19[hwt 0]]:
[./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
[node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
 socket 2[core 23[hwt 0]]:
[./././././././.][./././././././.][././././B/B/B/B][./././././././.]
[node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
 socket 3[core 27[hwt 0]]:
[./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
[node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
 socket 3[core 31[hwt 0]]:
[./././././././.][./././././././.][./././././././.][././././B/B/B/B]
[node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
Hello world from process 2 of 8
Hello world from process 6 of 8
Hello world from process 3 of 8
Hello world from process 7 of 8
Hello world from process 1 of 8
Hello world from process 5 of 8
Hello world from process 0 of 8
Hello world from process 4 of 8

Regards,
Tetsuya Mishima


> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I tried again with -cpus-per-proc 2 as shown below.
> > Here, I found that "-map-by socket:span" worked well.
> >
> > [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
> > -map-by socket:span myprog
> > [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]],
socket
> > 1[core 9[hwt 0]]: [./././././././.][B/B/././.
> > /././.][./././././././.][./././././././.]
> > [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]],
socket
> > 1[core 11[hwt 0]]: [./././././././.][././B/B
> > /./././.][./././././././.][./././././././.]
> > [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]],
socket
> > 2[core 17[hwt 0]]: [./././././././.][./././.
> > /./././.][B/B/./././././.][./././././././.]
> > [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]],
socket
> > 2[core 19[hwt 0]]: [./././././././.][./././.
> > /./././.][././B/B/./././.][./././././././.]
> > [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]],
socket
> > 3[core 25[hwt 0]]: [./././././././.][./././.
> > /./././.][./././././././.][B/B/./././././.]
> > [node03.cluster:10879] MCW rank 7 bound to socket 3[core 26[hwt 0]],
socket
> > 3[core 27[hwt 0]]: [./././././././.][./././.
> > /./././.][./././././././.][././B/B/./././.]
> > [node03.cluster:10879] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]]: [B/B/./././././.][././././.
> > /././.][./././././././.][./././././././.]
> > [node03.cluster:10879] MCW rank 1 bound to socket 0[core 2[hwt 0]],
socket
> > 0[c

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-10 Thread Ralph Castain

Hmmm...okay, I understand the scenario. Must be something in the algo when it 
only has one node, so it shouldn't be too hard to track down.

I'm off on travel for a few days, but will return to this when I get back.

Sorry for delay - will try to look at this while I'm gone, but can't promise 
anything :-(


On Dec 10, 2013, at 6:58 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph, sorry for confusing.
> 
> We usually logon to "manage", which is our control node.
> From manage, we submit job or enter a remote node such as
> node03 by torque interactive mode(qsub -I).
> 
> At that time, instead of torque, I just did rsh to node03 from manage
> and ran myprog on the node. I hope you could understand what I did.
> 
> Now, I retried with "-host node03", which still causes the problem:
> (I comfirmed local run on manage caused the same problem too)
> 
> [mishima@manage ~]$ rsh node03
> Last login: Wed Dec 11 11:38:57 from manage
> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node03 demos]$
> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> -cpus-per-proc 4 -map-by socket myprog
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:node03
>   #processes:  2
>   #cpus:  1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> It' strange, but I have to report that "-map-by socket:span" worked well.
> 
> [mishima@node03 demos]$ mpirun -np 8 -host node03 -report-bindings
> -cpus-per-proc 4 -map-by socket:span myprog
> [node03.cluster:11871] MCW rank 2 bound to socket 1[core 8[hwt 0]], socket
> 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], s
> ocket 1[core 11[hwt 0]]:
> [./././././././.][B/B/B/B/./././.][./././././././.][./././././././.]
> [node03.cluster:11871] MCW rank 3 bound to socket 1[core 12[hwt 0]], socket
> 1[core 13[hwt 0]], socket 1[core 14[hwt 0]],
> socket 1[core 15[hwt 0]]:
> [./././././././.][././././B/B/B/B][./././././././.][./././././././.]
> [node03.cluster:11871] MCW rank 4 bound to socket 2[core 16[hwt 0]], socket
> 2[core 17[hwt 0]], socket 2[core 18[hwt 0]],
> socket 2[core 19[hwt 0]]:
> [./././././././.][./././././././.][B/B/B/B/./././.][./././././././.]
> [node03.cluster:11871] MCW rank 5 bound to socket 2[core 20[hwt 0]], socket
> 2[core 21[hwt 0]], socket 2[core 22[hwt 0]],
> socket 2[core 23[hwt 0]]:
> [./././././././.][./././././././.][././././B/B/B/B][./././././././.]
> [node03.cluster:11871] MCW rank 6 bound to socket 3[core 24[hwt 0]], socket
> 3[core 25[hwt 0]], socket 3[core 26[hwt 0]],
> socket 3[core 27[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][B/B/B/B/./././.]
> [node03.cluster:11871] MCW rank 7 bound to socket 3[core 28[hwt 0]], socket
> 3[core 29[hwt 0]], socket 3[core 30[hwt 0]],
> socket 3[core 31[hwt 0]]:
> [./././././././.][./././././././.][./././././././.][././././B/B/B/B]
> [node03.cluster:11871] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:11871] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> Hello world from process 2 of 8
> Hello world from process 6 of 8
> Hello world from process 3 of 8
> Hello world from process 7 of 8
> Hello world from process 1 of 8
> Hello world from process 5 of 8
> Hello world from process 0 of 8
> Hello world from process 4 of 8
> 
> Regards,
> Tetsuya Mishima
> 
> 
>> On Dec 10, 2013, at 6:05 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I tried again with -cpus-per-proc 2 as shown below.
>>> Here, I found that "-map-by socket:span" worked well.
>>> 
>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings -cpus-per-proc 2
>>> -map-by socket:span myprog
>>> [node03.cluster:10879] MCW rank 2 bound to socket 1[core 8[hwt 0]],
> socket
>>> 1[core 9[hwt 0]]: [./././././././.][B/B/././.
>>> /././.][./././././././.][./././././././.]
>>> [node03.cluster:10879] MCW rank 3 bound to socket 1[core 10[hwt 0]],
> socket
>>> 1[core 11[hwt 0]]: [./././././././.][././B/B
>>> /./././.][./././././././.][./././././././.]
>>> [node03.cluster:10879] MCW rank 4 bound to socket 2[core 16[hwt 0]],
> socket
>>> 2[core 17[hwt 0]]: [./././././././.][./././.
>>> /./././.][B/B/./././././.][./././././././.]
>>> [node03.cluster:10879] MCW rank 5 bound to socket 2[core 18[hwt 0]],
> socket
>>> 2[core 19[hwt 0]]: [./././././././.][./././.
>>> /./././.][././B/B/./././.][./././././././.]
>>> [node03.cluster:10879] MCW rank 6 bound to socket 3[core 24[hwt 0]]

[OMPI users] Fwd: Open MPI can't launch remote nodes via SSH

Re: [OMPI users] Fwd: Open MPI can't launch remote nodes via SSH

Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

Re: [OMPI users] configure: error: Could not run a simple Fortran program. Aborting.

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

12 matches

Site Navigation

Mail list logo

Footer information