I see... Now it all makes sense. Since Cpus_allowed(_list) shows the effective CPU mask, I expected Mems_allowed(_list) would do the same.
Thanks for the clarification. Cheers, Hristo -----Original Message----- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Brice Goglin Sent: Thursday, July 20, 2017 9:52 AM To: users@lists.open-mpi.org Subject: Re: [OMPI users] NUMA interaction with Open MPI Hello Mems_allowed_list is what your current cgroup/cpuset allows. It is different from what mbind/numactl/hwloc/... change. The former is a root-only restriction that cannot be ignored by processes placed in that cgroup. The latter is a user-changeable binding that must be inside the former. Brice Le 19/07/2017 17:29, Iliev, Hristo a écrit : > Giles, > > Mems_allowed_list has never worked for me: > > $ uname -r > 3.10.0-514.26.1.e17.x86_64 > > $ numactl -H | grep available > available: 2 nodes (0-1) > > $ grep Mems_allowed_list /proc/self/status > Mems_allowed_list: 0-1 > > $ numactl -m 0 grep Mems_allowed_list /proc/self/status > Mems_allowed_list: 0-1 > > It seems that whatever structure Mems_allowed_list exposes is outdated. One > should use "numactl -s" instead: > > $ numactl -s > policy: default > preferred node: current > physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 > 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 > 45 46 47 > cpubind: 0 1 > nodebind: 0 1 > membind: 0 1 > > $ numactl -m 0 numactl -s > policy: bind > preferred node: 0 > physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 > 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 > 45 46 47 > cpubind: 0 1 > nodebind: 0 1 > membind: 0 > > $ numactl -i all numactl -s > policy: interleave > preferred node: 0 (interleave next) > interleavemask: 0 1 > interleavenode: 0 > physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 > 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 > 45 46 47 > cpubind: 0 1 > nodebind: 0 1 > membind: 0 1 > > I wouldn't ask Open MPI not to bind the processes as the policy set by > numactl is of higher precedence compared to what orterun/shepherd sets, at > least with non-MPI programs: > > $ orterun -n 2 --bind-to core --map-by socket numactl -i all numactl > -s > policy: interleave > preferred node: 1 (interleave next) > interleavemask: 0 1 > interleavenode: 1 > physcpubind: 0 24 > cpubind: 0 > nodebind: 0 > membind: 0 1 > policy: interleave > preferred node: 1 (interleave next) > interleavemask: 0 1 > interleavenode: 1 > physcpubind: 12 36 > cpubind: 1 > nodebind: 1 > membind: 0 1 > > Cheers, > Hristo > > -----Original Message----- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > Gilles Gouaillardet > Sent: Monday, July 17, 2017 5:43 AM > To: Open MPI Users <users@lists.open-mpi.org> > Subject: Re: [OMPI users] NUMA interaction with Open MPI > > Adam, > > keep in mind that by default, recent Open MPI bind MPI tasks > - to cores if -np 2 > - to NUMA domain otherwise (which is a socket in most cases, unless > you are running on a Xeon Phi) > > so unless you specifically asked mpirun to do a binding consistent > with your needs, you might simply try to ask no binding at all mpirun > --bind-to none ... > > i am not sure whether you can direclty ask Open MPI to do the memory > binding you expect from the command line. > anyway, as far as i am concerned, > mpirun --bind-to none numactl --interleave=all ... > should do what you expect > > if you want to be sure, you can simply mpirun --bind-to none numactl > --interleave=all grep Mems_allowed_list /proc/self/status and that > should give you an hint > > Cheers, > > Gilles > > > On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote: >> I'll start with my question upfront: Is there a way to do the >> equivalent of telling mpirun to do 'numactl --interleave=all' on the >> processes that it runs? Or if I want to control the memory placement >> of my applications run through MPI will I need to use libnuma for >> this? I tried doing "mpirun <Open MPI options> numactl >> --interleave=all <app name and options>". I don't know how to >> explicitly verify if this ran the numactl command on each host or not >> but based on the performance I'm seeing, it doesn't seem like it did (or >> something else is causing my poor performance). >> >> More details: For the particular image I'm benchmarking with, I have >> a multi-threaded application which requires 60 GB of RAM to run if >> it's run on one machine. It allocates one large ping/pong buffer >> upfront and uses this to avoid copies when updating the image at each >> step. I'm running in AWS and comparing performance on an r3.8xlarge >> (16 CPUs, 244 GB RAM, 10 Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, >> 20 Gbps). Running on a single X1, my application runs ~3x faster >> than the R3; using numactl --interleave=all has a significant >> positive effect on its performance, I assume because the various >> threads that are running are accessing memory spread out across the nodes >> rather than most of them having slow access to it. So far so good. >> >> My application also supports distributing across machines via MPI. >> When doing this, the memory requirement scales linearly with the >> number of machines; there are three pinch points that involve large >> (GBs of data) all-to-all communication. For the slowest of these >> three, I've pipelined this step and use MPI_Ialltoallv() to hide as much of >> the latency as I can. >> When run on R3 instances, overall runtime scales very well as >> machines are added. Still so far so good. >> >> My problems start with the X1 instances. I do get scaling as I add >> more machines, but it is significantly worse than with the R3s. This >> isn't just a matter of there being more CPUs and the MPI communication time >> dominating. >> The actual time spent in the MPI all-to-all communication is >> significantly longer than on the R3s for the same number of machines, >> despite the network bandwidth being twice as high (in a post from a >> few days ago some folks helped me with MPI settings to improve the >> network communication speed - from toy benchmark MPI tests I know I'm >> getting faster communication on the X1s than on the R3s, so this >> feels likely to be an issue with NUMA, though I'd be interested in any other >> thoughts. >> >> I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php >> but this didn't seem to have what I was looking for. I want MPI to >> let my application use all CPUs on the system (I'm the only one >> running on it)... I just want to control the memory placement. >> >> Thanks for the help. >> -Adam >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users