Adam,

keep in mind that by default, recent Open MPI bind MPI tasks
- to cores if -np 2
- to NUMA domain otherwise (which is a socket in most cases, unless
you are running on a Xeon Phi)

so unless you specifically asked mpirun to do a binding consistent
with your needs, you might simply try to ask no binding at all
mpirun --bind-to none ...

i am not sure whether you can direclty ask Open MPI to do the memory
binding you expect from the command line.
anyway, as far as i am concerned,
mpirun --bind-to none numactl --interleave=all ...
should do what you expect

if you want to be sure, you can simply
mpirun --bind-to none numactl --interleave=all grep Mems_allowed_list
/proc/self/status
and that should give you an hint

Cheers,

Gilles


On Mon, Jul 17, 2017 at 4:19 AM, Adam Sylvester <op8...@gmail.com> wrote:
> I'll start with my question upfront: Is there a way to do the equivalent of
> telling mpirun to do 'numactl --interleave=all' on the processes that it
> runs?  Or if I want to control the memory placement of my applications run
> through MPI will I need to use libnuma for this?  I tried doing "mpirun
> <Open MPI options> numactl --interleave=all <app name and options>".  I
> don't know how to explicitly verify if this ran the numactl command on each
> host or not but based on the performance I'm seeing, it doesn't seem like it
> did (or something else is causing my poor performance).
>
> More details: For the particular image I'm benchmarking with, I have a
> multi-threaded application which requires 60 GB of RAM to run if it's run on
> one machine.  It allocates one large ping/pong buffer upfront and uses this
> to avoid copies when updating the image at each step.  I'm running in AWS
> and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10 Gbps)
> vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps).  Running on a single X1, my
> application runs ~3x faster than the R3; using numactl --interleave=all has
> a significant positive effect on its performance,  I assume because the
> various threads that are running are accessing memory spread out across the
> nodes rather than most of them having slow access to it.  So far so good.
>
> My application also supports distributing across machines via MPI.  When
> doing this, the memory requirement scales linearly with the number of
> machines; there are three pinch points that involve large (GBs of data)
> all-to-all communication.  For the slowest of these three, I've pipelined
> this step and use MPI_Ialltoallv() to hide as much of the latency as I can.
> When run on R3 instances, overall runtime scales very well as machines are
> added.  Still so far so good.
>
> My problems start with the X1 instances.  I do get scaling as I add more
> machines, but it is significantly worse than with the R3s.  This isn't just
> a matter of there being more CPUs and the MPI communication time dominating.
> The actual time spent in the MPI all-to-all communication is significantly
> longer than on the R3s for the same number of machines, despite the network
> bandwidth being twice as high (in a post from a few days ago some folks
> helped me with MPI settings to improve the network communication speed -
> from toy benchmark MPI tests I know I'm getting faster communication on the
> X1s than on the R3s, so this feels likely to be an issue with NUMA, though
> I'd be interested in any other thoughts.
>
> I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
> didn't seem to have what I was looking for.  I want MPI to let my
> application use all CPUs on the system (I'm the only one running on it)... I
> just want to control the memory placement.
>
> Thanks for the help.
> -Adam
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to