Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-27 Thread Brice Goglin via users
The attached patch (against 4.0.2) should fix it, I'll prepare a PR to
fix this upstream.

Brice


Le 27/11/2019 à 00:41, Brice Goglin via users a écrit :
> It looks like NUMA is broken, while others such as SOCKET and L3CACHE
> work fine. A quick look in opal_hwloc_base_get_relative_locality() and
> friends tells me that those functions were not properly updated to hwloc
> 2.0 NUMA changes. I'll try to understand what's going on tomorrow.
>
> Rebuilding OMPI with an external hwloc 1.11.x might avoid the issue in
> the meantime.
>
> Beware that splitting on NUMA might become meaningless on some platforms
> in the future (there are already some x86 platforms where some NUMA
> nodes are attached to the Packages while others are attached to each
> half of the same Packages).
>
> Brice
>
>
> Le 26/11/2019 à 23:12, Hatem Elshazly via users a écrit :
>> Hello,
>>
>>
>> I'm trying to split the world communicator by NUMA using
>> MPI_Comm_split_type. I expected to get as many sub communicators as
>> the NUMA nodes, but what I get is as many sub communicator as the
>> number of mpi processes each containing one process.
>>
>>
>> Attached is a reproducer code. I tried it using version 4.0.2 built
>> with GNU 9.2.0 on a skyline and haswell machines and both behave
>> similarly.
>>
>>
>> Can anyone point me to why does it behave like that? Is this expected
>> or am I confusing something?
>>
>>
>> Thanks in advance,
>>
>> Hatem
>>
>> Junior Researcher -- Barcelona Supercomputing Center (BSC)
>>
>>
>>
>> http://bsc.es/disclaimer
diff --git a/opal/mca/hwloc/base/hwloc_base_util.c b/opal/mca/hwloc/base/hwloc_base_util.c
index ba26ba0ac6..daf2fa2064 100644
--- a/opal/mca/hwloc/base/hwloc_base_util.c
+++ b/opal/mca/hwloc/base/hwloc_base_util.c
@@ -1215,16 +1215,84 @@ int opal_hwloc_base_cpu_list_parse(const char *slot_str,
 return OPAL_SUCCESS;
 }
 
+static void opal_hwloc_base_get_relative_locality_by_depth(hwloc_topology_t topo, unsigned d,
+   hwloc_cpuset_t loc1, hwloc_cpuset_t loc2,
+   opal_hwloc_locality_t *locality, bool *shared)
+{
+unsigned width, w;
+hwloc_obj_t obj;
+int sect1, sect2;
+
+/* get the width of the topology at this depth */
+width = hwloc_get_nbobjs_by_depth(topo, d);
+
+/* scan all objects at this depth to see if
+ * our locations overlap with them
+ */
+for (w=0; w < width; w++) {
+/* get the object at this depth/index */
+obj = hwloc_get_obj_by_depth(topo, d, w);
+/* see if our locations intersect with the cpuset for this obj */
+sect1 = hwloc_bitmap_intersects(obj->cpuset, loc1);
+sect2 = hwloc_bitmap_intersects(obj->cpuset, loc2);
+/* if both intersect, then we share this level */
+if (sect1 && sect2) {
+*shared = true;
+switch(obj->type) {
+case HWLOC_OBJ_NODE:
+*locality |= OPAL_PROC_ON_NUMA;
+break;
+case HWLOC_OBJ_SOCKET:
+*locality |= OPAL_PROC_ON_SOCKET;
+break;
+#if HWLOC_API_VERSION < 0x2
+case HWLOC_OBJ_CACHE:
+if (3 == obj->attr->cache.depth) {
+*locality |= OPAL_PROC_ON_L3CACHE;
+} else if (2 == obj->attr->cache.depth) {
+*locality |= OPAL_PROC_ON_L2CACHE;
+} else {
+*locality |= OPAL_PROC_ON_L1CACHE;
+}
+break;
+#else
+case HWLOC_OBJ_L3CACHE:
+*locality |= OPAL_PROC_ON_L3CACHE;
+break;
+case HWLOC_OBJ_L2CACHE:
+*locality |= OPAL_PROC_ON_L2CACHE;
+break;
+case HWLOC_OBJ_L1CACHE:
+*locality |= OPAL_PROC_ON_L1CACHE;
+break;
+#endif
+case HWLOC_OBJ_CORE:
+*locality |= OPAL_PROC_ON_CORE;
+break;
+case HWLOC_OBJ_PU:
+*locality |= OPAL_PROC_ON_HWTHREAD;
+break;
+default:
+/* just ignore it */
+break;
+}
+break;
+}
+/* otherwise, we don't share this
+ * object - but we still might share another object
+ * on this level, so we have to keep searching
+ */
+}
+}
+
 opal_hwloc_locality_t opal_hwloc_base_get_relative_locality(hwloc_topology_t topo,
 char *cpuset1, char *cpuset2)
 {
 opal_hwloc_locality_t locality;
-hwloc_obj_t obj;
-unsigned depth, d, width, w;
+hwloc_cpuset_t loc1, loc2;
+unsigned depth, d;
 bool shared;
 hwloc_obj_type_t type;
-int sect1, sect2;
-hwloc_cpuset_t loc1, loc2;
 
 /* start with what we know - they share a node on a cluster
  * NOTE: we may alter that latter part as hwloc

[OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
  cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
Short and portable answer: you need to sync before the Isend or you will
send garbage data.

Assuming you are willing to go for a less portable solution you can get the
OMPI streams and add your kernels inside, so that the sequential order will
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI,
one for device-to-host and one for host-to-device, that can be queried with
the non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream
and mca_common_cuda_get_htod_stream).

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
users@lists.open-mpi.org> wrote:

> Hi,
>   Suppose I have this piece of code and I use cuda-aware MPI,
>   cudaMalloc(&sbuf,sz);
>
>Kernel1<<<...,stream>>>(...,sbuf);
>MPI_Isend(sbuf,...);
>Kernel2<<<...,stream>>>();
>
>
>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
> make sure data in sbuf is ready to send?  If not, why?
>
>   Thank you.
>
> --Junchao Zhang
>


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users


On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)


Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or htod streams?

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
  cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao  wrote:

> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
> wrote:
>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
> Ideally, I want to formulate my code into a series of asynchronous "kernel
> launch, kernel launch, ..." without synchronization, so that I can hide
> kernel launch overhead. It now seems I have to sync before MPI calls (even
> nonblocking calls)
>

Then you need a means to ensure sequential execution, and this is what the
streams provide. Unfortunately, I looked into the code and I'm afraid there
is currently no realistic way to do what you need. My previous comment was
based on an older code, that seems to be 1) unmaintained currently, and 2)
only applicable to the OB1 PML + OpenIB BTL combo. As recent versions of
OMPI have moved away from the OpenIB BTL, relying more heavily on UCX for
Infiniband support, the old code is now deprecated. Sorry for giving you
hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.



>
>
>>
>> Assuming you are willing to go for a less portable solution you can get
>> the OMPI streams and add your kernels inside, so that the sequential order
>> will guarantee correctness of your isend. We have 2 hidden CUDA streams in
>> OMPI, one for device-to-host and one for host-to-device, that can be
>> queried with the non-MPI standard compliant functions
>> (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream).
>>
>> Which streams (dtoh or htod) should I use to insert kernels producing
> send data and kernels using received data? I imagine MPI uses GPUDirect
> RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh
> or htod streams?
>

>
>> George.
>>
>>
>> On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Hi,
>>>   Suppose I have this piece of code and I use cuda-aware MPI,
>>>   cudaMalloc(&sbuf,sz);
>>>
>>>Kernel1<<<...,stream>>>(...,sbuf);
>>>MPI_Isend(sbuf,...);
>>>Kernel2<<<...,stream>>>();
>>>
>>>
>>>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
>>> make sure data in sbuf is ready to send?  If not, why?
>>>
>>>   Thank you.
>>>
>>> --Junchao Zhang
>>>
>>


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
I was pointed to "2.7. Synchronization and Memory Ordering" of  
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf. It is on topic. But 
unfortunately it is too short and I could not understand it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host 
function "must not make any CUDA API calls". I am not sure if MPI_Isend 
qualifies as such functions.
--Junchao Zhang


On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)

Then you need a means to ensure sequential execution, and this is what the 
streams provide. Unfortunately, I looked into the code and I'm afraid there is 
currently no realistic way to do what you need. My previous comment was based 
on an older code, that seems to be 1) unmaintained currently, and 2) only 
applicable to the OB1 PML + OpenIB BTL combo. As recent versions of OMPI have 
moved away from the OpenIB BTL, relying more heavily on UCX for Infiniband 
support, the old code is now deprecated. Sorry for giving you hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.




Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or htod streams?

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
  cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang


Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI 
nonblocking calls could take an extra stream argument and work like a kernel 
launch, it would be wonderful.
--Junchao Zhang


On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd 
mailto:josh...@mellanox.com>> wrote:
Why not spawn num_threads, where num_threads is the number of Kernels to launch 
, and compile with the “--default-stream per-thread” option?

Then you could use MPI in thread multiple mode to achieve your objective.

Something like:



void *launch_kernel(void *dummy)
{
float *data;
cudaMalloc(&data, N * sizeof(float));

kernel<<>>(data, N);

cudaStreamSynchronize(0);

MPI_Isend(data,..);
return NULL;
}

int main()
{
MPI_init_thread(&argc,&argv,MPI_THREAD_MULTIPLE,&provided);
const int num_threads = 8;

pthread_t threads[num_threads];

for (int i = 0; i < num_threads; i++) {
if (pthread_create(&threads[i], NULL, launch_kernel, 0)) {
fprintf(stderr, "Error creating threadn");
return 1;
}
}

for (int i = 0; i < num_threads; i++) {
if(pthread_join(threads[i], NULL)) {
fprintf(stderr, "Error joining threadn");
return 2;
}
}
cudaDeviceReset();

MPI_Finalize();
}




From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Zhang, Junchao via users
Sent: Wednesday, November 27, 2019 5:43 PM
To: George Bosilca mailto:bosi...@icl.utk.edu>>
Cc: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>; Open MPI 
Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] CUDA mpi question

I was pointed to "2.7. Synchronization and Memory Ordering" of  
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf.
 It is on topic. But unfortunately it is too short and I could not understand 
it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host 
function "must not make any CUDA API calls". I am not sure if MPI_Isend 
qualifies as such functions.
--Junchao Zhang


On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)

Then you need a means to ensure sequential execution, and this is what the 
streams provide. Unfortunately, I looked into the code and I'm afraid there is 
currently no realistic way to do what you need. My previous comment was based 
on an older code, that seems to be 1) unmaintained currently, and 2) only 
applicable to the OB1 PML + OpenIB BTL combo. As recent versions of OMPI have 
moved away from the OpenIB BTL, relying more heavily on UCX for Infiniband 
support, the old code is now deprecated. Sorry for giving you hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.




Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or htod streams?

George.


On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users 
mailto:users@lists.open-mpi.org>> wrote:
Hi,
  Suppose I have this piece of code and I use cuda-aware MPI,
  cudaMalloc(&sbuf,sz);
   Kernel1<<<...,stream>>>(...,sbuf);
   MPI_Isend(sbuf,...);
   Kernel2<<<...,stream>>>();

  Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make 
sure data in sbuf is ready to send?  If not, why?

  Thank you.

--Junchao Zhang


[OMPI users] speed of model is slow with openmpi

2019-11-27 Thread Mahesh Shinde via users
Hi,

I am running a physics based boundary layer model with parallel code which
uses openmpi libraries. I installed openmpi. I am running it on general
purpose Azure machine with 8 cores, 32GB RAM. I compiled the code with
*gfortran
-O3 -fopenmp -o abc.exe abc.f* and then *mpirun -np 8 ./abc.exe* But i
found slow speed with 4 and 8 cores. I also tried it with trial version of
intel parallel studio suite, but  no improvement in the speed.

why this is happening? is the code not properly utilize mpi? does it need
HPC machine on Azure? Does it compiled with intel ifort?

your suggestions/comments are welcome.

Thanks and regards.
Mahesh


Re: [OMPI users] speed of model is slow with openmpi

2019-11-27 Thread Gilles Gouaillardet via users
Your gfortran command line strongly suggests your program is serial and 
does not use MPI at all.


Consequently, mpirun will simply spawn 8 identical instances of the very 
same program, and no speed up should be expected


(but you can expect some slow down and/or file corruption).


If you observe similar behaviour with Open MPI and IntelMPI, then this 
is very unlikely an Open MPI issue,


and this mailing it not the right place to discuss a general 
MPI/parallelization performance issue.



Cheers,


Gilles

On 11/28/2019 1:54 PM, Mahesh Shinde via users wrote:

Hi,

I am running a physics based boundary layer model with parallel code 
which uses openmpi libraries. I installed openmpi. I am running it on 
general purpose Azure machine with 8 cores, 32GB RAM. I compiled the 
code with /*gfortran -O3 -fopenmp -o abc.exe abc.f*/ and then /*mpirun 
-np 8 ./abc.exe*/ But i found slow speed with 4 and 8 cores. I also 
tried it with trial version of intel parallel studio suite, but  no 
improvement in the speed.


why this is happening? is the code not properly utilize mpi? does it 
need HPC machine on Azure? Does it compiled with intel ifort?


your suggestions/comments are welcome.

Thanks and regards.
Mahesh