Giles,

Mems_allowed_list has never worked for me:

$ uname -r
3.10.0-514.26.1.e17.x86_64

$ numactl -H | grep available
available: 2 nodes (0-1)

$ grep Mems_allowed_list /proc/self/status
Mems_allowed_list:      0-1

$ numactl -m 0 grep Mems_allowed_list /proc/self/status
Mems_allowed_list:      0-1

It seems that whatever structure Mems_allowed_list exposes is outdated. One 
should use "numactl -s" instead:

$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

$ numactl -m 0 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0

$ numactl -i all numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

I wouldn't ask Open MPI not to bind the processes as the policy set by numactl 
is of higher precedence compared to what orterun/shepherd sets, at least with 
non-MPI programs:

$ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl -s
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 0 24
cpubind: 0
nodebind: 0
membind: 0 1
policy: interleave
preferred node: 1 (interleave next)
interleavemask: 0 1
interleavenode: 1
physcpubind: 12 36
cpubind: 1
nodebind: 1
membind: 0 1

Cheers,
Hristo

-----Original Message-----
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles 
Gouaillardet
Sent: Monday, July 17, 2017 5:43 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] NUMA interaction with Open MPI

Adam,

keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)

so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all
mpirun --bind-to none ...

i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect

if you want to be sure, you can simply
mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
/proc/self/status
and that should give you an hint

Cheers,

Gilles


On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote:
> I'll start with my question upfront: Is there a way to do the equivalent of
> telling mpirun to do 'numactl --interleave=all' on the processes that it
> runs?  Or if I want to control the memory placement of my applications run
> through MPI will I need to use libnuma for this?  I tried doing "mpirun
> <Open MPI options> numactl --interleave=all <app name and options>".  I
> don't know how to explicitly verify if this ran the numactl command on each
> host or not but based on the performance I'm seeing, it doesn't seem like it
> did (or something else is causing my poor performance).
>
> More details: For the particular image I'm benchmarking with, I have a
> multi-threaded application which requires 60 GB of RAM to run if it's run on
> one machine.  It allocates one large ping/pong buffer upfront and uses this
> to avoid copies when updating the image at each step.  I'm running in AWS
> and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
> vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps).  Running on a single X1, my
> application runs ~3x faster than the R3; using numactl --interleave=all has
> a significant positive effect on its performance,  I assume because the
> various threads that are running are accessing memory spread out across the
> nodes rather than most of them having slow access to it.  So far so good.
>
> My application also supports distributing across machines via MPI.  When
> doing this, the memory requirement scales linearly with the number of
> machines; there are three pinch points that involve large (GBs of data)
> all-to-all communication.  For the slowest of these three, I've pipelined
> this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
> When run on R3 instances, overall runtime scales very well as machines are
> added.  Still so far so good.
>
> My problems start with the X1 instances.  I do get scaling as I add more
> machines, but it is significantly worse than with the R3s.  This isn't just
> a matter of there being more CPUs and the MPI communication time dominating.
> The actual time spent in the MPI all-to-all communication is significantly
> longer than on the R3s for the same number of machines, despite the network
> bandwidth being twice as high (in a post from a few days ago some folks
> helped me with MPI settings to improve the network communication speed -
> from toy benchmark MPI tests I know I'm getting faster communication on the
> X1s than on the R3s, so this feels likely to be an issue with NUMA, though
> I'd be interested in any other thoughts.
>
> I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
> didn't seem to have what I was looking for.  I want MPI to let my
> application use all CPUs on the system (I'm the only one running on it)... I
> just want to control the memory placement.
>
> Thanks for the help.
> -Adam
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to