I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs?  Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this?  I tried doing "mpirun
<Open MPI options> numactl --interleave=all <app name and options>".  I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like
it did (or something else is causing my poor performance).

More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run
on one machine.  It allocates one large ping/pong buffer upfront and uses
this to avoid copies when updating the image at each step.  I'm running in
AWS and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10
Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps).  Running on a single
X1, my application runs ~3x faster than the R3; using numactl
--interleave=all has a significant positive effect on its performance,  I
assume because the various threads that are running are accessing memory
spread out across the nodes rather than most of them having slow access to
it.  So far so good.

My application also supports distributing across machines via MPI.  When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication.  For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I
can.  When run on R3 instances, overall runtime scales very well as
machines are added.  Still so far so good.

My problems start with the X1 instances.  I do get scaling as I add more
machines, but it is significantly worse than with the R3s.  This isn't just
a matter of there being more CPUs and the MPI communication time
dominating.  The actual time spent in the MPI all-to-all communication is
significantly longer than on the R3s for the same number of machines,
despite the network bandwidth being twice as high (in a post from a few
days ago some folks helped me with MPI settings to improve the network
communication speed - from toy benchmark MPI tests I know I'm getting
faster communication on the X1s than on the R3s, so this feels likely to be
an issue with NUMA, though I'd be interested in any other thoughts.

I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for.  I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)...
I just want to control the memory placement.

Thanks for the help.
-Adam
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to