Re: [OMPI users] bug in CUDA support for dual-processor systems?

Zbigniew Koza Tue, 31 Jul 2012 15:37:16 -0400

Thanks for a quick reply.

I do not know much about low-level CUDA and IPC,
but there's no problem using high-level CUDA to determine if
device A can talk to B via GPUDirect (cudaDeviceCanAccessPeer).

Then, for such connections, one only needs to callcudaDeviceEnablePeerAccessand then essentially "sit back and laugh" - given correct currentdevice and stream, functions like cudaMemcpyPeer work irrespectively ofwhether GPUDirectis on or off for a given pair of devices, the only difference being thespeed.So, I hope it should be possible to implement device-IOH-IOH-devicecommunication using low-level CUDA.Such functionality should be an important step in the "CPU-GPUhigh-performance war" :-),as 8-GPU fast-MPI-link systems bring a new meaning to a "GPU node" inGPU clusters...


Here is the output of my test program that was aimed at determining

a) aggregate, best-case transfer rate between 6 GPUs running in paralleland

b) whether devices on different IOHs can talk to each other:

3 [GB] in  78.6952 [ms] =  38.1218 GB/s (aggregate)
sending 600000000 bytes from device 0:
0 -> 0: 11.3454 [ms] 52.8848 GB/s
0 -> 1: 90.3628 [ms] 6.6399 GB/s
0 -> 2: 113.396 [ms] 5.29117 GB/s
0 -> 3: 113.415 [ms] 5.29032 GB/s
0 -> 4: 170.307 [ms] 3.52305 GB/s
0 -> 5: 169.613 [ms] 3.53747 GB/s

This shows that even if devices are on different IOHs, like 0 and 4,they can talk to each other at a fantastic speed of 3.5 GB/s

and it would be pity if OpenMPI did not used this opportunity.

I have also 2 questions:

a) I noticed that on my 6-GPU 2-CPU platform the initialization of CUDA4.2 takes a looooong time, approx 10 seconds.

Do you think I should report this as a bug to nVidia?

b) Is there any info on running OpenMPI + CUDA? For example, what arethe dependencies of transfer rates and latencies on transfer size?A dedicated www page, blog or whatever? How can I know if the currentproblem was solved?




Many thanks for making CUDA available in OpenMPI.

Regards

Z Koza

W dniu 31.07.2012 19:39, Rolf vandeVaart pisze:

The current implementation does assume that the GPUs are on the same IOH and 
therefore can use the IPC features of the CUDA library for communication.
One of the initial motivations for this was that to be able to detect whether 
GPUs can talk to one another, the CUDA library has to be initialized and the 
GPUs have to be selected by each rank.  It is at that point that we can 
determine whether the IPC will work between the GPUs.    However, this means 
that the GPUs need to be selected by each rank prior to the call to MPI_Init as 
that is where we determine whether IPC is possible, and we were trying to avoid 
that requirement.

I will submit a ticket against this and see if we can improve this.

Rolf

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On Behalf Of Zbigniew Koza
Sent: Tuesday, July 31, 2012 12:38 PM
To: us...@open-mpi.org
Subject: [OMPI users] bug in CUDA support for dual-processor systems?

Hi,

I wrote a simple program to see if OpenMPI can really handle cuda pointers as
promised in the FAQ and how efficiently.
The program (see below) breaks if MPI communication is to be performed
between two devices that are on the same node but under different IOHs in a
dual-processor Intel machine.
Note that  cudaMemCpy works for such devices, although not as efficiently as
for the devices on the same IOH and GPUDirect enabled.

Here's the output from my program:

===============================

  mpirun -n 6 ./a.out

Init
Init
Init
Init
Init
Init
rank: 1, size: 6
rank: 2, size: 6
rank: 3, size: 6
rank: 4, size: 6
rank: 5, size: 6
rank: 0, size: 6
device 3 is set
Process 3 is on typhoon1
Using regular memory
device 0 is set
Process 0 is on typhoon1
Using regular memory
device 4 is set
Process 4 is on typhoon1
Using regular memory
device 1 is set
Process 1 is on typhoon1
Using regular memory
device 5 is set
Process 5 is on typhoon1
Using regular memory
device 2 is set
Process 2 is on typhoon1
Using regular memory
^C^[[A^C
zkoza@typhoon1:~/multigpu$
zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
zkoza@typhoon1:~/multigpu$ vim cudamussings.c
zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
zkoza@typhoon1:~/multigpu$ mpirun -n 6 ./a.out Process 1 of 6 is on
typhoon1 Process 2 of 6 is on typhoon1 Process 0 of 6 is on typhoon1 Process
4 of 6 is on typhoon1 Process 5 of 6 is on typhoon1 Process 3 of 6 is on
typhoon1 device 2 is set device 1 is set device 0 is set Using regular memory
device 5 is set device 3 is set device 4 is set
Host->device bandwidth for processor 1: 1587.993499 MB/sec device
Host->bandwidth for processor 2: 1570.275316 MB/sec device bandwidth for
Host->processor 3: 1569.890751 MB/sec device bandwidth for processor 5:
Host->1483.637702 MB/sec device bandwidth for processor 0: 1480.888029
Host->MB/sec device bandwidth for processor 4: 1476.241371 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [1] bandwidth: 3338.57 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [1] bandwidth: 420.85 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[1] bandwidth: 362.13 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[1] bandwidth: 6552.35 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [2] bandwidth: 3238.88 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [2] bandwidth: 418.18 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[2] bandwidth: 362.06 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[2] bandwidth: 5022.82 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [3] bandwidth: 3295.32 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [3] bandwidth: 418.90 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[3] bandwidth: 359.16 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Device[3] bandwidth: 5019.89 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Host  [4] bandwidth: 4619.55 MB/sec
MPI_Send/MPI_Receive,  Device[0] -> Host  [4] bandwidth: 419.24 MB/sec
MPI_Send/MPI_Receive,  Host  [0] -> Device[4] bandwidth: 364.52 MB/sec
--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
will cause the program to abort.
   cuIpcOpenMemHandle return value:   205
   address: 0x200200000
Check the cuda.h file for what the return value means. Perhaps a reboot of
the node will clear the problem.
--------------------------------------------------------------------------
[typhoon1:06098] Failed to register remote memory, rc=-1 [typhoon1:06098]
[[33788,1],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 465

========================================================



Comment:
In my machine there are 2 six-core intel processors with HT on, yielding
24 virtual processors, and  6 Tesla C2070s.
The devices  are grouped in two groups, one with 4 and the other with 2
devices.
Devices in the same group can talk to each other via GPUDirect at approx
6GB/s; devices in different groups can use cudaMemCpy and UVA at
somewhat smaller transfer rates.


my OpenMPI is openmpi-1.9a1r26904 compiled from sources

./configure -prefix=/home/zkoza/openmpi.1.9.cuda
--with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/lib

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation Built on
Thu_Apr__5_00:24:31_PDT_2012 Cuda compilation tools, release 4.2,
V0.2.1221

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Ubuntu 12.04 64-bit

Nvidia  Driver Version: 295.41         |

The program was compiled with:

mpicc prog.c -lcuda -lcudart -L/usr/local/cuda/lib64 -I/usr/local/cuda/include



================================================
SOURCE CODE:
================================================

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <sys/time.h>
#include <mpi.h>

#define NREPEAT 20
#define NBYTES 100000000


#define CALL(x)\
{\
   cudaError_t err = x;\
   if (cudaSuccess != err)\
   {\
     printf("CUDA ERROR %s at %d\n", cudaGetErrorString(err),  __LINE__ ); \
     cudaGetLastError();\
   }\
}

int main (int argc, char *argv[])
{
         int rank, size, n, len, numbytes;
         void *a_h, *a_d;
         struct timeval time[2];
         double bandwidth;
         char name[MPI_MAX_PROCESSOR_NAME];
         MPI_Status status;

         MPI_Init (&argc, &argv);
         MPI_Comm_rank (MPI_COMM_WORLD, &rank);
         MPI_Comm_size (MPI_COMM_WORLD, &size);
         MPI_Get_processor_name(name, &len);

         printf("Process %d of %d is on %s\n", rank, size, name);
         fflush(stdout);

         CALL( cudaSetDevice(rank) );
         printf("device %d is set\n", rank);
         fflush(stdout);

#ifdef PINNED
         if (rank == 0)
                 printf("Using pinned memory \n");
         CALL( cudaMallocHost( (void **) &a_h, NBYTES) );
#else
         if (rank == 0)
                 printf("Using regular memory \n");
         a_h = malloc(NBYTES);
#endif
         CALL( cudaMalloc( (void **) &a_d, NBYTES) );

         MPI_Barrier(MPI_COMM_WORLD);

         gettimeofday(&time[0], NULL);
         for (n=0; n<NREPEAT; n++ )
         {
                 CALL( cudaMemcpy(a_d, a_h, NBYTES,
cudaMemcpyHostToDevice) );
         }
         gettimeofday(&time[1], NULL);

         bandwidth = time[1].tv_sec - time[0].tv_sec;
         bandwidth += 1.e-6*(time[1].tv_usec - time[0].tv_usec);
         bandwidth = (double)NBYTES*NREPEAT/1.e6/bandwidth;

         printf("Host->device bandwidth for processor %d: %f MB/sec\n",
rank, bandwidth);

         /* Test MPI send/recv bandwidth. */

         MPI_Barrier(MPI_COMM_WORLD);

         int i, proc;
         for (proc = 1; proc < size; proc++)
         {
                 for (i = 0; i < 4; i++)
                 {
                          const int from_host = (i & 1) == 0;
                         const int to_host =   (i & 2) == 0;
                         const char* tab[2] = {"Device", "Host  "};
                         void * ptr[2] = {a_d, a_h};;

                         MPI_Barrier(MPI_COMM_WORLD);
                         gettimeofday(&time[0], NULL);
                         for (n=0; n<NREPEAT; n++)
                         {
                                 if (rank == 0)
                                         MPI_Send(ptr[from_host],
NBYTES/sizeof(int), MPI_INT, proc, 0, MPI_COMM_WORLD);
                                 else if (rank == proc)
                                         MPI_Recv(ptr[to_host],
NBYTES/sizeof(int), MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
                         }

                         gettimeofday(&time[1], NULL);
                         //        printf("MPI status: %d\n", status);

                         bandwidth = time[1].tv_sec - time[0].tv_sec;
                         bandwidth += 1.e-6*(time[1].tv_usec -
time[0].tv_usec);
                         bandwidth = NBYTES*NREPEAT/1.e6/bandwidth;
                         if (rank == 0)
                         {
                                 printf("MPI_Send/MPI_Receive,  %s[%d]
-> %s[%d] bandwidth: %4.2f MB/sec\n",
                                                       tab[from_host],
0,  tab[to_host], proc, bandwidth);
                                 fflush(stdout);
                         }
         }
}
#ifdef PINNED
         CALL( cudaFreeHost(a_h) );
#else
         free(a_h);
#endif
         CALL( cudaFree(a_d) ) ;

         MPI_Finalize();
         return 0;
}



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] bug in CUDA support for dual-processor systems?

Reply via email to