Thank you, Gilles for the quick response. The code comes from a clustering
application, bu let me try to explain simply what the pattern is. It's a
bit long than I expected.



The program has the pattern BSP pattern with *compute()* followed by
collective *allreduce()* And it does many iterations over these two.

Each process is a Java process with just the main thread. However in Java
the process and main thread have their own PIDs and act as two LWPs in
Linux.

Now, let's take two binding scenarios. For simplicity, I'll assume a node
with 2 sockets each with 4-cores. The real one I ran has 2 sockets with 12
cores each.

1. *--map-by ppr:8:node:PE=1 --bind-to core* results in something like
below.

[image: Inline image 3]
where each process is bound to 1 core. The blue dots show the main thread
in Java. It too is bound to the same core as its parent process by default.

2. *--map-by ppr:8:node  --bind-to none * This is similar to 1, but now
processes are not bound (or bound to all cores). However, from the program,
we *explicitly bind its main thread to 1 core*. It gives something like
below.

[image: Inline image 4]
The results we got suggest approach 2 gives better communication
performance than 1. The btl used is openib. Here's a graph showing the
variation in timings. It shows for other cases that use more than 1 thread
to do the computation as well. In all patterns communication is done
through the main thread only.

What is peculiar is the two points within the dotted circle. Intuitively
they should overlap as it only has the main thread in each Java process and
that main is bound to 1 core. The difference is how the parent process is
bound with MPI. The red line is for *Case 1* above and the blue is for *Case
2*

The green line is when both parent process and threads are unbound.


[image: Inline image 6]







On Thu, Jun 23, 2016 at 12:36 AM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Can you please provide more details on your config, how test are performed
> and the results ?
>
>
> to be fair, you should only compare cases in which mpi tasks are bound to
> the same sockets.
>
> for example, if socket0 has core[0-7] and socket1 has core[8-15]
>
> it is fair to compare {task0,task1} bound on
>
> {0,8}, {[0-1],[8-9]}, {[0-7],[8-15]}
>
> but it is unfair to compare
>
> {0,1} and {0,8} or {[0-7],[8-15]}
>
> since {0,1} does not involve traffic on the QPI, but {0,8} does.
> depending on the btl you are using, it might involve or not an other
> "helper" thread.
> if your task is bound on one core, and assuming there is no SMT, then the
> task and the helper do time sharing.
> but if the task is bound on more than one core, then the task and the
> helper run in parallel.
>
>
> Cheers,
>
> Gilles
>
> On 6/23/2016 1:21 PM, Saliya Ekanayake wrote:
>
> Hi,
>
> I am trying to understand this peculiar behavior where the communication
> time in OpenMPI changes depending on the number of process elements (cores)
> the process is bound to.
>
> Is this expected?
>
> Thank you,
> saliya
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29523.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29524.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply via email to