I'll start with my question upfront: Is there a way to do the equivalent of telling mpirun to do 'numactl --interleave=all' on the processes that it runs? Or if I want to control the memory placement of my applications run through MPI will I need to use libnuma for this? I tried doing "mpirun <Open MPI options> numactl --interleave=all <app name and options>". I don't know how to explicitly verify if this ran the numactl command on each host or not but based on the performance I'm seeing, it doesn't seem like it did (or something else is causing my poor performance).
More details: For the particular image I'm benchmarking with, I have a multi-threaded application which requires 60 GB of RAM to run if it's run on one machine. It allocates one large ping/pong buffer upfront and uses this to avoid copies when updating the image at each step. I'm running in AWS and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps). Running on a single X1, my application runs ~3x faster than the R3; using numactl --interleave=all has a significant positive effect on its performance, I assume because the various threads that are running are accessing memory spread out across the nodes rather than most of them having slow access to it. So far so good. My application also supports distributing across machines via MPI. When doing this, the memory requirement scales linearly with the number of machines; there are three pinch points that involve large (GBs of data) all-to-all communication. For the slowest of these three, I've pipelined this step and use MPI_Ialltoallv() to hide as much of the latency as I can. When run on R3 instances, overall runtime scales very well as machines are added. Still so far so good. My problems start with the X1 instances. I do get scaling as I add more machines, but it is significantly worse than with the R3s. This isn't just a matter of there being more CPUs and the MPI communication time dominating. The actual time spent in the MPI all-to-all communication is significantly longer than on the R3s for the same number of machines, despite the network bandwidth being twice as high (in a post from a few days ago some folks helped me with MPI settings to improve the network communication speed - from toy benchmark MPI tests I know I'm getting faster communication on the X1s than on the R3s, so this feels likely to be an issue with NUMA, though I'd be interested in any other thoughts. I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this didn't seem to have what I was looking for. I want MPI to let my application use all CPUs on the system (I'm the only one running on it)... I just want to control the memory placement. Thanks for the help. -Adam
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users