I see... Now it all makes sense. Since Cpus_allowed(_list) shows the effective 
CPU mask, I expected Mems_allowed(_list) would do the same.

Thanks for the clarification.

Cheers,
Hristo

-----Original Message-----
From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Brice Goglin
Sent: Thursday, July 20, 2017 9:52 AM
To: users@lists.open-mpi.org
Subject: Re: [OMPI users] NUMA interaction with Open MPI

Hello

Mems_allowed_list is what your current cgroup/cpuset allows. It is different 
from what mbind/numactl/hwloc/... change.
The former is a root-only restriction that cannot be ignored by processes 
placed in that cgroup.
The latter is a user-changeable binding that must be inside the former.

Brice




Le 19/07/2017 17:29, Iliev, Hristo a écrit :
> Giles,
>
> Mems_allowed_list has never worked for me:
>
> $ uname -r
> 3.10.0-514.26.1.e17.x86_64
>
> $ numactl -H | grep available
> available: 2 nodes (0-1)
>
> $ grep Mems_allowed_list /proc/self/status
> Mems_allowed_list:      0-1
>
> $ numactl -m 0 grep Mems_allowed_list /proc/self/status
> Mems_allowed_list:      0-1
>
> It seems that whatever structure Mems_allowed_list exposes is outdated. One 
> should use "numactl -s" instead:
>
> $ numactl -s
> policy: default
> preferred node: current
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
> 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
> 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1
>
> $ numactl -m 0 numactl -s
> policy: bind
> preferred node: 0
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
> 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
> 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0
>
> $ numactl -i all numactl -s
> policy: interleave
> preferred node: 0 (interleave next)
> interleavemask: 0 1
> interleavenode: 0
> physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
> 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
> 45 46 47
> cpubind: 0 1
> nodebind: 0 1
> membind: 0 1
>
> I wouldn't ask Open MPI not to bind the processes as the policy set by 
> numactl is of higher precedence compared to what orterun/shepherd sets, at 
> least with non-MPI programs:
>
> $ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl 
> -s
> policy: interleave
> preferred node: 1 (interleave next)
> interleavemask: 0 1
> interleavenode: 1
> physcpubind: 0 24
> cpubind: 0
> nodebind: 0
> membind: 0 1
> policy: interleave
> preferred node: 1 (interleave next)
> interleavemask: 0 1
> interleavenode: 1
> physcpubind: 12 36
> cpubind: 1
> nodebind: 1
> membind: 0 1
>
> Cheers,
> Hristo
>
> -----Original Message-----
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of 
> Gilles Gouaillardet
> Sent: Monday, July 17, 2017 5:43 AM
> To: Open MPI Users <users@lists.open-mpi.org>
> Subject: Re: [OMPI users] NUMA interaction with Open MPI
>
> Adam,
>
> keep in mind that by default, recent Open MPI bind MPI tasks
> - to cores if -np 2
> - to NUMA domain otherwise (which is a socket in most cases, unless 
> you are running on a Xeon Phi)
>
> so unless you specifically asked mpirun to do a binding consistent 
> with your needs, you might simply try to ask no binding at all mpirun 
> --bind-to none ...
>
> i am not sure whether you can direclty ask Open MPI to do the memory 
> binding you expect from the command line.
> anyway, as far as i am concerned,
> mpirun --bind-to none numactl --interleave=all ...
> should do what you expect
>
> if you want to be sure, you can simply mpirun --bind-to none numactl 
> --interleave=all grep Mems_allowed_list /proc/self/status and that 
> should give you an hint
>
> Cheers,
>
> Gilles
>
>
> On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote:
>> I'll start with my question upfront: Is there a way to do the 
>> equivalent of telling mpirun to do 'numactl --interleave=all' on the 
>> processes that it runs?  Or if I want to control the memory placement 
>> of my applications run through MPI will I need to use libnuma for 
>> this?  I tried doing "mpirun <Open MPI options> numactl 
>> --interleave=all <app name and options>".  I don't know how to 
>> explicitly verify if this ran the numactl command on each host or not 
>> but based on the performance I'm seeing, it doesn't seem like it did (or 
>> something else is causing my poor performance).
>>
>> More details: For the particular image I'm benchmarking with, I have 
>> a multi-threaded application which requires 60 GB of RAM to run if 
>> it's run on one machine.  It allocates one large ping/pong buffer 
>> upfront and uses this to avoid copies when updating the image at each 
>> step.  I'm running in AWS and comparing performance on an r3.8xlarge 
>> (16 CPUs, 244 GB RAM, 10 Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 
>> 20 Gbps).  Running on a single X1, my application runs ~3x faster 
>> than the R3; using numactl --interleave=all has a significant 
>> positive effect on its performance,  I assume because the various 
>> threads that are running are accessing memory spread out across the nodes 
>> rather than most of them having slow access to it.  So far so good.
>>
>> My application also supports distributing across machines via MPI.  
>> When doing this, the memory requirement scales linearly with the 
>> number of machines; there are three pinch points that involve large 
>> (GBs of data) all-to-all communication.  For the slowest of these 
>> three, I've pipelined this step and use MPI_Ialltoallv() to hide as much of 
>> the latency as I can.
>> When run on R3 instances, overall runtime scales very well as 
>> machines are added.  Still so far so good.
>>
>> My problems start with the X1 instances.  I do get scaling as I add 
>> more machines, but it is significantly worse than with the R3s.  This 
>> isn't just a matter of there being more CPUs and the MPI communication time 
>> dominating.
>> The actual time spent in the MPI all-to-all communication is 
>> significantly longer than on the R3s for the same number of machines, 
>> despite the network bandwidth being twice as high (in a post from a 
>> few days ago some folks helped me with MPI settings to improve the 
>> network communication speed - from toy benchmark MPI tests I know I'm 
>> getting faster communication on the X1s than on the R3s, so this 
>> feels likely to be an issue with NUMA, though I'd be interested in any other 
>> thoughts.
>>
>> I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php 
>> but this didn't seem to have what I was looking for.  I want MPI to 
>> let my application use all CPUs on the system (I'm the only one 
>> running on it)... I just want to control the memory placement.
>>
>> Thanks for the help.
>> -Adam
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to