Adam, keep in mind that by default, recent Open MPI bind MPI tasks - to cores if -np 2 - to NUMA domain otherwise (which is a socket in most cases, unless you are running on a Xeon Phi)
so unless you specifically asked mpirun to do a binding consistent with your needs, you might simply try to ask no binding at all mpirun --bind-to none ... i am not sure whether you can direclty ask Open MPI to do the memory binding you expect from the command line. anyway, as far as i am concerned, mpirun --bind-to none numactl --interleave=all ... should do what you expect if you want to be sure, you can simply mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list /proc/self/status and that should give you an hint Cheers, Gilles On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote: > I'll start with my question upfront: Is there a way to do the equivalent of > telling mpirun to do 'numactl --interleave=all' on the processes that it > runs? Or if I want to control the memory placement of my applications run > through MPI will I need to use libnuma for this? I tried doing "mpirun > <Open MPI options> numactl --interleave=all <app name and options>". I > don't know how to explicitly verify if this ran the numactl command on each > host or not but based on the performance I'm seeing, it doesn't seem like it > did (or something else is causing my poor performance). > > More details: For the particular image I'm benchmarking with, I have a > multi-threaded application which requires 60 GB of RAM to run if it's run on > one machine. It allocates one large ping/pong buffer upfront and uses this > to avoid copies when updating the image at each step. I'm running in AWS > and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps) > vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single X1, my > application runs ~3x faster than the R3; using numactl --interleave=all has > a significant positive effect on its performance, I assume because the > various threads that are running are accessing memory spread out across the > nodes rather than most of them having slow access to it. So far so good. > > My application also supports distributing across machines via MPI. When > doing this, the memory requirement scales linearly with the number of > machines; there are three pinch points that involve large (GBs of data) > all-to-all communication. For the slowest of these three, I've pipelined > this step and use MPI_Ialltoallv() to hide as much of the latency as I can. > When run on R3 instances, overall runtime scales very well as machines are > added. Still so far so good. > > My problems start with the X1 instances. I do get scaling as I add more > machines, but it is significantly worse than with the R3s. This isn't just > a matter of there being more CPUs and the MPI communication time dominating. > The actual time spent in the MPI all-to-all communication is significantly > longer than on the R3s for the same number of machines, despite the network > bandwidth being twice as high (in a post from a few days ago some folks > helped me with MPI settings to improve the network communication speed - > from toy benchmark MPI tests I know I'm getting faster communication on the > X1s than on the R3s, so this feels likely to be an issue with NUMA, though > I'd be interested in any other thoughts. > > I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this > didn't seem to have what I was looking for. I want MPI to let my > application use all CPUs on the system (I'm the only one running on it)... I > just want to control the memory placement. > > Thanks for the help. > -Adam > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users