date:20150224

Re: [OMPI users] Questions regarding MPI intercommunicators & collectives

2015-02-24 Thread George Bosilca


> On Feb 23, 2015, at 10:20 , Harald Servat  wrote:
> 
> Hello list,
> 
>  we have several questions regarding calls to collectives using 
> intercommunicators. In man for MPI_Bcast, there is a notice for the 
> inter-communicator case that reads the text below our questions.
> 
>  If an I is an intercomunicator for communicattors

Not for communicator but for groups (slight technical difference)

> C1={p1,p2,p3} and C2={p4,p5,p6}, and a process p3 (from C1) wants to 
> broadcast a message to C2. Is it mandatory that p1 and p2 have to call to 
> MPI_Bcast? Or can the user skip adding these calls?

The MPI_Bcast is collective over all the processes involved in the 
communicator. As you pointer out in the text you cited, this is indeed required 
by the MPI standard (MPI standard 3.0 page 148 starting from line 43). To be 
clear this strictly means that all processes (no exceptions) should call the 
MPI_Bcast.

> 
>  BTW, what is the behavior for the broadcast for p1 and p2 in this case, 
> simply return?

It is implementation dependent. 

> 
>  Will MPI fail if MPI_PROC_NULL is not given for the parameter root in p1 and 
> p2?

In the best case it will fail. However, as figuring out that multiple roots 
exists on the sam group requires communications, I would guess that most of the 
MPI implementations will simply have some random unexpected behaviors.

  George.


> 
> Thank you very much in advance.
> 
> 
> 
> ** When Communicator is an Inter-communicator
> 
> When the communicator is an inter-communicator, the root process in the first 
> group broadcasts data to all the processes in the second group. The first 
> group defines the root process. That process uses MPI_ROOT as the value of 
> its root argument. The remaining processes use MPI_PROC_NULL as the value of 
> their root argument. All processes in the second group use the rank of that 
> root process in the first group as the value of their root argument. The 
> receive buffer arguments of the processes in the second group must be 
> consistent with the send buffer argument of the root process in the first 
> group.
> 
> 
> 
> WARNING / LEGAL TEXT: This message is intended only for the use of the
> individual or entity to which it is addressed and may contain
> information which is privileged, confidential, proprietary, or exempt
> from disclosure under applicable law. If you are not the intended
> recipient or the person responsible for delivering the message to the
> intended recipient, you are strictly prohibited from disclosing,
> distributing, copying, or in any way using this message. If you have
> received this communication in error, please notify the sender and
> destroy and delete any copies you may have received.
> 
> http://www.bsc.es/disclaimer
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26370.php

[OMPI users] Why are the static libs different if compiled with or without dynamic switch?

2015-02-24 Thread twurgl


I am setting up Openmpi 1.8.4.  The first time I compiled, I had the following:

version=1.8.4.I1404211913
./configure \
--disable-vt \
--prefix=/apps/share/openmpi/$version \
--disable-shared \
--enable-static \
--with-verbs \
--enable-mpirun-prefix-by-default \
--with-memory-manager=none \
--with-hwloc \
--with-lsf=/apps/share/LSF/9.1.3/9.1 \
--with-lsf-libdir=/apps/share/LSF/9.1.3/9.1/linux2.6-glibc2.3-x86_64/lib \
--with-wrapper-cflags="-shared-intel" \
--with-wrapper-cxxflags="-shared-intel" \
--with-wrapper-ldflags="-shared-intel" \
--with-wrapper-fcflags="-shared-intel" \
--enable-mpi-ext

And when installed I get (as a sample): 

  -rw-r--r-- 1 tommy 460g3 6881702 Feb 19 14:58 libmpi.a

Now the second time I install, I had the same as above for the configure, but
this time I took out the "--disable-shared" option.

and again, as a sample 

  -rw-r--r-- 1 tommy 460g3 6641598 Feb 24 13:53 libmpi.a

Can someone tell me why the static libs are different (sizes) when compiling or
not compiling the dynamic ones?  Seems to me that static ones should be
identical.  Is this an issue?

thanks for any info

[OMPI users] MPIIO and OrangeFS

2015-02-24 Thread vithanousek

Hello,

Im not sure if I have my OrangeFS (2.8.8) and OpenMPI (1.8.4) set up corectly. 
One short questin?

Is it needed to have OrangeFS  mounted  through kernel module, if I want use 
MPIIO? 
My simple MPIIO hello world program doesnt work, If i havent mounted OrangeFS. 
When I mount OrangeFS, it works. So I'm not sure if OMPIO (or ROMIO) is using 
pvfs2 servers directly or if it is using kernel module.

Sorry for stupid question, but I didnt find any documentation about it.

Thanks for replays
Hanousek Vít

[OMPI users] machinefile binding error

2015-02-24 Thread Galloway, Jack D

I recently upgraded my CentOS kernel and am running 2.6.32-504.8.1.el6.x86_64, 
in this upgrade I also decided to upgrade my intel/openmpi codes.

I upgraded from:

intel version 13.1.2, with openmpi 1.6.5
intel 15.0.2, with openmpi 1.8.4

Previously a command of "mpirun -np NP -machinefile MACH executable" would 
return expected results, particularly in how the machinefile was mapped to mpi 
tasks.  However, now when I try to run a code (which worked in the 13.1.2/1.6.5 
paradigm) things behave anomalously.

For instance, if I have machine file ("mach_burn_24s") that consists of:
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

Before the allocation would follow as expected (i.e. -np 25 -machinefile above) 
would give 1 task on tebow, and 24 on tebow121, and if I assigned 361 the 
entire machinefile would be filled up.

However now it's not the case.  If I type "mpirun -np 24 -machinefile 
burn_machs/mach_burn_24s hostname", I get the following result:
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow121
tebow
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121

Now there are 16 cores on "tebow", but I only requested one task in the 
machinefile (so I assume).  And furthermore if I request 361 I get the 
following catastrophic error:

--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: NONE
   Node:tebow125
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

All the compute nodes (tebow121-135) have 24+ cores on them.  I believe some 
configuration change has occurred which has duped the system into trying to go 
off the reported number of cores, but even then it seems to be getting things 
wrong (i.e. not pulling the right number of cores).

The config line use previously (which worked without issue according to the 
machinefile specification) was:
  $ ./configure --prefix=/opt/openmpi/openmpi-1.6.5 --with-openib 
--with-openib-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc

The config line which I now use is:
./configure --prefix=/opt/openmpi/openmpi-1.8.4 --with-verbs 
--with-verbs-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc

I'm at a loss where to look for the solution, any help is appreciated.

--Jack

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

2015-02-24 Thread Jeff Squyres (jsquyres)

The --disable-dlopen option actually snips out some code from the Open MPI code 
base: it disables a feature (and the code that goes along with it).

Hence, it makes sense that the resulting library would be a different size: 
there's actually less code compiled in it.


> On Feb 24, 2015, at 2:45 PM, twu...@goodyear.com wrote:
> 
> 
> I am setting up Openmpi 1.8.4.  The first time I compiled, I had the 
> following:
> 
> version=1.8.4.I1404211913
> ./configure \
>--disable-vt \
>--prefix=/apps/share/openmpi/$version \
>--disable-shared \
>--enable-static \
>--with-verbs \
>--enable-mpirun-prefix-by-default \
>--with-memory-manager=none \
>--with-hwloc \
>--with-lsf=/apps/share/LSF/9.1.3/9.1 \
>--with-lsf-libdir=/apps/share/LSF/9.1.3/9.1/linux2.6-glibc2.3-x86_64/lib \
>--with-wrapper-cflags="-shared-intel" \
>--with-wrapper-cxxflags="-shared-intel" \
>--with-wrapper-ldflags="-shared-intel" \
>--with-wrapper-fcflags="-shared-intel" \
>--enable-mpi-ext
> 
> And when installed I get (as a sample): 
> 
>  -rw-r--r-- 1 tommy 460g3 6881702 Feb 19 14:58 libmpi.a
> 
> Now the second time I install, I had the same as above for the configure, but
> this time I took out the "--disable-shared" option.
> 
> and again, as a sample 
> 
>  -rw-r--r-- 1 tommy 460g3 6641598 Feb 24 13:53 libmpi.a
> 
> Can someone tell me why the static libs are different (sizes) when compiling 
> or
> not compiling the dynamic ones?  Seems to me that static ones should be
> identical.  Is this an issue?
> 
> thanks for any info
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26381.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] machinefile binding error

2015-02-24 Thread Ralph Castain

Ah, now that’s a “feature” :-)

Seriously, it *is* actually a new feature of the 1.8 series. We now go out and 
actually sense the number of cores on the system and set the number of slots to 
that value unless you tell us otherwise. It was something people continually 
nagged us about, and so we made the change.

In your case, you could just put slots=1 on the first line of your machine file 
and we should respect it.


> On Feb 24, 2015, at 12:49 PM, Galloway, Jack D  wrote:
> 
> I recently upgraded my CentOS kernel and am running 
> 2.6.32-504.8.1.el6.x86_64, in this upgrade I also decided to upgrade my 
> intel/openmpi codes.
>  
> I upgraded from:
>  
> intel version 13.1.2, with openmpi 1.6.5
> intel 15.0.2, with openmpi 1.8.4
>  
> Previously a command of “mpirun –np NP –machinefile MACH executable” would 
> return expected results, particularly in how the machinefile was mapped to 
> mpi tasks.  However, now when I try to run a code (which worked in the 
> 13.1.2/1.6.5 paradigm) things behave anomalously.  
>  
> For instance, if I have machine file (“mach_burn_24s”) that consists of:
> tebow
> tebow121 slots=24
> tebow122 slots=24
> tebow123 slots=24
> tebow124 slots=24
> tebow125 slots=24
> tebow126 slots=24
> tebow127 slots=24
> tebow128 slots=24
> tebow129 slots=24
> tebow130 slots=24
> tebow131 slots=24
> tebow132 slots=24
> tebow133 slots=24
> tebow134 slots=24
> tebow135 slots=24
>  
> Before the allocation would follow as expected (i.e. –np 25 –machinefile 
> above) would give 1 task on tebow, and 24 on tebow121, and if I assigned 361 
> the entire machinefile would be filled up.
>  
> However now it’s not the case.  If I type “mpirun -np 24 -machinefile 
> burn_machs/mach_burn_24s hostname”, I get the following result:
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow
> tebow121
> tebow
> tebow121
> tebow121
> tebow121
> tebow121
> tebow121
> tebow121
> tebow121
>  
> Now there are 16 cores on “tebow”, but I only requested one task in the 
> machinefile (so I assume).  And furthermore if I request 361 I get the 
> following catastrophic error:
>  
> --
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>  
>   Node:  tebow124
>  
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may be 
> degraded.
> --
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>  
>Bind to: NONE
>Node:tebow125
>#processes:  2
>#cpus:   1
>  
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
>  
> All the compute nodes (tebow121-135) have 24+ cores on them.  I believe some 
> configuration change has occurred which has duped the system into trying to 
> go off the reported number of cores, but even then it seems to be getting 
> things wrong (i.e. not pulling the right number of cores).  
>  
> The config line use previously (which worked without issue according to the 
> machinefile specification) was:
>   $ ./configure --prefix=/opt/openmpi/openmpi-1.6.5 --with-openib 
> --with-openib-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc
>  
> The config line which I now use is:
> ./configure --prefix=/opt/openmpi/openmpi-1.8.4 --with-verbs 
> --with-verbs-libdir=/usr/lib64 CC=icc F77=ifort FC=ifort CXX=icpc
>  
> I’m at a loss where to look for the solution, any help is appreciated.
>  
> --Jack
>  
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26383.php 
>

Re: [OMPI users] MPIIO and OrangeFS

2015-02-24 Thread Rob Latham

On 02/24/2015 02:00 PM, vithanousek wrote:

Hello,

Im not sure if I have my OrangeFS (2.8.8) and OpenMPI (1.8.4) set up corectly.
One short questin?

Is it needed to have OrangeFS mounted through kernel module, if I want use
MPIIO?

nope!

My simple MPIIO hello world program doesnt work, If i havent mounted OrangeFS.
When I mount OrangeFS, it works. So I'm not sure if OMPIO (or ROMIO) is using
pvfs2 servers directly or if it is using kernel module.

Sorry for stupid question, but I didnt find any documentation about it.

http://www.pvfs.org/cvs/pvfs-2-8-branch-docs/doc/pvfs2-quickstart/pvfs2-quickstart.php#sec:romio

It sounds like you have not configured your MPI implementation with
PVFS2 support (OrangeFS is a re-branding of PVFS2, but as far as MPI-IO
is concerned, they are the same).

OpenMPI passes flags to romio like this at configure time:

--with-io-romio-flags="--with-file-system=pvfs2+ufs+nfs"

I'm not sure how OMPIO takes flags.

If pvfs2-ping and pvfs2-cp and pvfs2-ls work, then you can bypass the
kernel.

also, please check return codes:

http://stackoverflow.com/questions/22859269/what-do-mpi-io-error-codes-mean/26373193#26373193

==rob

Thanks for replays
Hanousek Vít
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/02/26382.php

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

2015-02-24 Thread Tom Wurgler

Did you mean --disable-shared instead of --disable-dlopen?

And I am still confused.  With "--disable-shared" I get a bigger static library 
than without it?

thanks
  


From: users  on behalf of Jeff Squyres (jsquyres) 

Sent: Tuesday, February 24, 2015 3:56 PM
To: Open MPI User's List
Subject: Re: [OMPI users] Why are the static libs different if compiled with or 
without dynamic switch?

The --disable-dlopen option actually snips out some code from the Open MPI code 
base: it disables a feature (and the code that goes along with it).

Hence, it makes sense that the resulting library would be a different size: 
there's actually less code compiled in it.


> On Feb 24, 2015, at 2:45 PM, twu...@goodyear.com wrote:
>
>
> I am setting up Openmpi 1.8.4.  The first time I compiled, I had the 
> following:
>
> version=1.8.4.I1404211913
> ./configure \
>--disable-vt \
>--prefix=/apps/share/openmpi/$version \
>--disable-shared \
>--enable-static \
>--with-verbs \
>--enable-mpirun-prefix-by-default \
>--with-memory-manager=none \
>--with-hwloc \
>--with-lsf=/apps/share/LSF/9.1.3/9.1 \
>--with-lsf-libdir=/apps/share/LSF/9.1.3/9.1/linux2.6-glibc2.3-x86_64/lib \
>--with-wrapper-cflags="-shared-intel" \
>--with-wrapper-cxxflags="-shared-intel" \
>--with-wrapper-ldflags="-shared-intel" \
>--with-wrapper-fcflags="-shared-intel" \
>--enable-mpi-ext
>
> And when installed I get (as a sample):
>
>  -rw-r--r-- 1 tommy 460g3 6881702 Feb 19 14:58 libmpi.a
>
> Now the second time I install, I had the same as above for the configure, but
> this time I took out the "--disable-shared" option.
>
> and again, as a sample
>
>  -rw-r--r-- 1 tommy 460g3 6641598 Feb 24 13:53 libmpi.a
>
> Can someone tell me why the static libs are different (sizes) when compiling 
> or
> not compiling the dynamic ones?  Seems to me that static ones should be
> identical.  Is this an issue?
>
> thanks for any info
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26381.php


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/02/26384.php

Re: [OMPI users] machinefile binding error

2015-02-24 Thread Galloway, Jack D

Thank you sir, that fixed the first problem, hopefully the second is as easy!

I still get the second error when trying to farm out on a “large” number of 
processors:

machine file (“mach_burn_24s”):
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

mpirun –np 361 –machinefile mach_burn_24s hostname

--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: NONE
   Node:tebow125
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

All the compute nodes (tebow121-135) have 24+ cores on them.

Any ideas?  Thanks!

--Jack


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, February 24, 2015 1:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

Ah, now that’s a “feature” :-)

Seriously, it *is* actually a new feature of the 1.8 series. We now go out and 
actually sense the number of cores on the system and set the number of slots to 
that value unless you tell us otherwise. It was something people continually 
nagged us about, and so we made the change.

In your case, you could just put slots=1 on the first line of your machine file 
and we should respect it.


On Feb 24, 2015, at 12:49 PM, Galloway, Jack D 
mailto:ja...@lanl.gov>> wrote:

I recently upgraded my CentOS kernel and am running 2.6.32-504.8.1.el6.x86_64, 
in this upgrade I also decided to upgrade my intel/openmpi codes.

I upgraded from:

intel version 13.1.2, with openmpi 1.6.5
intel 15.0.2, with openmpi 1.8.4

Previously a command of “mpirun –np NP –machinefile MACH executable” would 
return expected results, particularly in how the machinefile was mapped to mpi 
tasks.  However, now when I try to run a code (which worked in the 13.1.2/1.6.5 
paradigm) things behave anomalously.

For instance, if I have machine file (“mach_burn_24s”) that consists of:
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

Before the allocation would follow as expected (i.e. –np 25 –machinefile above) 
would give 1 task on tebow, and 24 on tebow121, and if I assigned 361 the 
entire machinefile would be filled up.

However now it’s not the case.  If I type “mpirun -np 24 -machinefile 
burn_machs/mach_burn_24s hostname”, I get the following result:
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow
tebow121
tebow
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121
tebow121

Now there are 16 cores on “tebow”, but I only requested one task in the 
machinefile (so I assume).  And furthermore if I request 361 I get the 
following catastrophic error:

--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: NONE
   Node:tebow125
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

All the compute nodes (

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

2015-02-24 Thread Jeff Squyres (jsquyres)

On Feb 24, 2015, at 4:09 PM, Tom Wurgler  wrote:
> 
> Did you mean --disable-shared instead of --disable-dlopen?

Ah, sorry -- my eyes read one thing, and my brain read another.  :-)

> And I am still confused.  With "--disable-shared" I get a bigger static 
> library than without it?

I see that Libtool chooses to give slightly different command line arguments to 
the linker when we build --disable-shared vs. --enable-shared.  I assume 
there's some slight mojo difference in there; I wouldn't worry about it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] machinefile binding error

2015-02-24 Thread Galloway, Jack D

I think the error may be due to a new architecture change (brought on perhaps 
by the intel compilers?).  Bad wording here, but I’m really stumbling.  As I 
add processors to the mpirun hostname call, at ~100 processors I get the 
following error, which may be informative to more seasoned eyes.  Additionally, 
I’ve attached the config.log in case something stands out, grepping on 
“catastrophic error” gives not too many results, but I don’t know if the error 
may be there or more subtle.

--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--
tebow
--
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:tebow125
  Application name:  /bin/hostname
  Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
  Location:  odls_default_module.c:551
--

Thanks,
--Jack




From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Galloway, Jack D
Sent: Tuesday, February 24, 2015 2:31 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

Thank you sir, that fixed the first problem, hopefully the second is as easy!

I still get the second error when trying to farm out on a “large” number of 
processors:

machine file (“mach_burn_24s”):
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
tebow132 slots=24
tebow133 slots=24
tebow134 slots=24
tebow135 slots=24

mpirun –np 361 –machinefile mach_burn_24s hostname

--
WARNING: a request was made to bind a process. While the system
supports binding the process itself, at least one node does NOT
support binding memory to the process location.

  Node:  tebow124

This usually is due to not having the required NUMA support installed
on the node. In some Linux distributions, the required support is
contained in the libnumactl and libnumactl-devel packages.
This is a warning only; your job will continue, though performance may be 
degraded.
--
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: NONE
   Node:tebow125
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

All the compute nodes (tebow121-135) have 24+ cores on them.

Any ideas?  Thanks!

--Jack


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, February 24, 2015 1:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] machinefile binding error

Ah, now that’s a “feature” :-)

Seriously, it *is* actually a new feature of the 1.8 series. We now go out and 
actually sense the number of cores on the system and set the number of slots to 
that value unless you tell us otherwise. It was something people continually 
nagged us about, and so we made the change.

In your case, you could just put slots=1 on the first line of your machine file 
and we should respect it.


On Feb 24, 2015, at 12:49 PM, Galloway, Jack D 
mailto:ja...@lanl.gov>> wrote:

I recently upgraded my CentOS kernel and am running 2.6.32-504.8.1.el6.x86_64, 
in this upgrade I also decided to upgrade my intel/openmpi codes.

I upgraded from:

intel version 13.1.2, with openmpi 1.6.5
intel 15.0.2, with openmpi 1.8.4

Previously a command of “mpirun –np NP –machinefile MACH executable” would 
return expected results, particularly in how the machinefile was mapped to mpi 
tasks.  However, now when I try to run a code (which worked in the 13.1.2/1.6.5 
paradigm) things behave anomalously.

For instance, if I have machine file (“mach_burn_24s”) that consists of:
tebow
tebow121 slots=24
tebow122 slots=24
tebow123 slots=24
tebow124 slots=24
tebow125 slots=24
tebow126 slots=24
tebow127 slots=24
tebow128 slots=24
tebow129 slots=24
tebow130 slots=24
tebow131 slots=24
te

Re: [OMPI users] Help on getting CMA works

2015-02-24 Thread Nathan Hjelm


I don't know the reasoning for requiring --with-cma to enable CMA but I
am looking at auto-detecting CMA instead of requiring Open MPI to be
configured with --with-cma. This will likely go into the 1.9 release
series and not 1.8.

-Nathan

On Thu, Feb 19, 2015 at 09:31:43PM -0500, Eric Chamberland wrote:
> Maybe it is a stupid question, but... why it is not tested and enabled by
> default at configure time since it is part of the kernel?
> 
> Eric
> 
> 
> On 02/19/2015 03:53 PM, Nathan Hjelm wrote:
> >Great! I will add an MCA variable to force CMA and also enable it if 1)
> >no yama and 2) no PR_SET_PTRACER.
> >
> >You might also look at using xpmem. You can find a version that supports
> >3.x @ https://github.com/hjelmn/xpmem . It is a kernel module +
> >userspace library that can be used by vader as a single-copy mechanism.
> >
> >In benchmarks it performs better than CMA but it may or may not perform
> >better with a real application.
> >
> >See:
> >
> >http://blogs.cisco.com/performance/the-vader-shared-memory-transport-in-open-mpi-now-featuring-3-flavors-of-zero-copy
> >
> >-Nathan
> >
> >On Thu, Feb 19, 2015 at 03:32:43PM -0500, Eric Chamberland wrote:
> >>On 02/19/2015 02:58 PM, Nathan Hjelm wrote:
> >>>On Thu, Feb 19, 2015 at 12:16:49PM -0500, Eric Chamberland wrote:
> On 02/19/2015 11:56 AM, Nathan Hjelm wrote:
> >If you have yama installed you can try:
> Nope, I do not have it installed... is it absolutely necessary? (and would
> it change something when it fails when I am root?)
> 
> Other question: In addition to "--with-cma" configure flag, do we have to
> pass any options to "mpicc" when compiling/linking an mpi application to 
> use
> cma?
> >>>No. CMA should work out of the box. You appear to have a setup I haven't
> >>>yet tested. It doesn't have yama nor does it have the PR_SET_PTRACER
> >>>prctl. Its quite possible there are no restriction on ptrace in this
> >>>setup. Can you try changing the following line at
> >>>opal/mca/btl/vader/btl_vader_component.c:370 from:
> >>>
> >>>bool cma_happy = false;
> >>>
> >>>to
> >>>
> >>>bool cma_happy = true;
> >>>
> >>ok! (as of the officiel release, this is line 386.)
> >>
> >>>and let me know if that works. If it does I will update vader to allow
> >>>CMA in this configuration.
> >>Yep!  It now works perfectly.  Testing with
> >>https://computing.llnl.gov/tutorials/mpi/samples/C/mpi_bandwidth.c, on my
> >>own computer (dual Xeon), I have this:
> >>
> >>Without CMA:
> >>
> >>***Message size:  100 *** best  /  avg  / worst (MB/sec)
> >>task pair:0 -1:8363.52 / 7946.77 / 5391.14
> >>
> >>with CMA:
> >>task pair:0 -1:9137.92 / 8955.98 / 7489.83
> >>
> >>Great!
> >>
> >>Now I have to bench my real application... ;-)
> >>
> >>Thanks!
> >>
> >>Eric
> >>
> >>___
> >>users mailing list
> >>us...@open-mpi.org
> >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>Link to this post: 
> >>http://www.open-mpi.org/community/lists/users/2015/02/26355.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/02/26362.php


pgpKWremq_SLC.pgp
Description: PGP signature

Re: [OMPI users] machinefile binding error

2015-02-24 Thread Ralph Castain

It looks to me like some of the nodes don’t have the required numactl packages 
installed. Why don’t you try launching the job without binding, just to see if 
everything works?

Just add “—bind-to none” to your cmd line and see if things work


> On Feb 24, 2015, at 2:21 PM, Galloway, Jack D  wrote:
> 
> I think the error may be due to a new architecture change (brought on perhaps 
> by the intel compilers?).  Bad wording here, but I’m really stumbling.  As I 
> add processors to the mpirun hostname call, at ~100 processors I get the 
> following error, which may be informative to more seasoned eyes.  
> Additionally, I’ve attached the config.log in case something stands out, 
> grepping on “catastrophic error” gives not too many results, but I don’t know 
> if the error may be there or more subtle.
>  
> --
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>  
>   Node:  tebow124
>  
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may be 
> degraded.
> --
> tebow
> --
> Open MPI tried to bind a new process, but something went wrong.  The
> process was killed without launching the target application.  Your job
> will now abort.
>  
>   Local host:tebow125
>   Application name:  /bin/hostname
>   Error message: hwloc_set_cpubind returned "Error" for bitmap "8,24"
>   Location:  odls_default_module.c:551
> --
>  
> Thanks,
> --Jack
>  
>  
>  
>  
> From: users [mailto:users-boun...@open-mpi.org 
> ] On Behalf Of Galloway, Jack D
> Sent: Tuesday, February 24, 2015 2:31 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] machinefile binding error
>  
> Thank you sir, that fixed the first problem, hopefully the second is as easy!
>  
> I still get the second error when trying to farm out on a “large” number of 
> processors:
>  
> machine file (“mach_burn_24s”):
> tebow
> tebow121 slots=24
> tebow122 slots=24
> tebow123 slots=24
> tebow124 slots=24
> tebow125 slots=24
> tebow126 slots=24
> tebow127 slots=24
> tebow128 slots=24
> tebow129 slots=24
> tebow130 slots=24
> tebow131 slots=24
> tebow132 slots=24
> tebow133 slots=24
> tebow134 slots=24
> tebow135 slots=24
>  
> mpirun –np 361 –machinefile mach_burn_24s hostname
>  
> --
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>  
>   Node:  tebow124
>  
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may be 
> degraded.
> --
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>  
>Bind to: NONE
>Node:tebow125
>#processes:  2
>#cpus:   1
>  
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
>  
> All the compute nodes (tebow121-135) have 24+ cores on them. 
>  
> Any ideas?  Thanks!
>  
> --Jack  
>  
>  
> From: users [mailto:users-boun...@open-mpi.org 
> ] On Behalf Of Ralph Castain
> Sent: Tuesday, February 24, 2015 1:57 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] machinefile binding error
>  
> Ah, now that’s a “feature” :-)
>  
> Seriously, it *is* actually a new feature of the 1.8 series. We now go out 
> and actually sense the number of cores on the system and set the number of 
> slots to that value unless you tell us otherwise. It was something people 
> continually nagged us about, and so we made the change.
>  
> In your case, you could just put slots=1 on the first line of your machine 
> file and we should respect it.
>  
>  
> On Feb 24, 2015, at 12:49 PM, Galloway, Jack D  > wrote:
>  
> I recently upgraded my CentOS kernel and am running 
> 2.6.32-504.8.1.el6.x86_64, in this upgrade I also decided to upgrade my 
> intel/

Re: [OMPI users] Questions regarding MPI intercommunicators & collectives

[OMPI users] Why are the static libs different if compiled with or without dynamic switch?

[OMPI users] MPIIO and OrangeFS

[OMPI users] machinefile binding error

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

Re: [OMPI users] machinefile binding error

Re: [OMPI users] MPIIO and OrangeFS

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

Re: [OMPI users] machinefile binding error

Re: [OMPI users] Why are the static libs different if compiled with or without dynamic switch?

Re: [OMPI users] machinefile binding error

Re: [OMPI users] Help on getting CMA works

Re: [OMPI users] machinefile binding error

13 matches

Site Navigation

Mail list logo

Footer information