I am replying to my own post, since no one else replied.

With the help of MVAPICH2 developer S. Potluri the problem was isolated and 
fixed. It was, as expected, due to the library not intercepting
the cudaHostAlloc() and cudaFreeHost() calls to register pinned memory, as 
would be required for the registration cache to work.
I replaced all of these calls with standard posix_memalign()/cudaHostRegister() 
calls in my code and the application now runs fine, with MVAPICH2
and with OpenMPI, and with registration cache enabled.

It would be desirable to have both libraries intercept the call to 
cudaHostAlloc/cudaFreeHost() (I assume OpenMPI 1.7 will have some level of cuda 
support), because
otherwise applications using GPUDirect are not guaranteed to work correctly 
with them, that is, they will exhibit undefined behavior.

Jens

On Nov 3, 2012, at 10:41 PM, Jens Glaser wrote:

> Hi,
> 
> I am working on a CUDA/MPI application. It uses page-locked host buffers 
> allocated with cudaHostAlloc(...,cudaHostAllocDefault), to which data from 
> the device is copied before calling MPI.
> The application, a particle simulation, reproducibly crashed or produced 
> undefined behavior at large particle numbers, and I could not explain why 
> this happened.
> After some considerable debugging time (trying two different MPI libraries, 
> MVAPICH2 1.9a and OpenMPI 1.6.1) I discovered openmpi's mpi_leave_pinned 
> parameter.
> Setting mpi_leave_pinned to 0 solved my problem, the crash did not occur 
> again! So far, excellent!
> 
> I do have a request, however. After looking at the output of
> 
> $ ompi_info --param mpi all
> 
> I get
>                 MCA mpi: parameter "mpi_leave_pinned" (current value: <-1>, 
> data source: default
>                          value)
>                          Whether to use the "leave pinned" protocol or not.  
> Enabling this
>                          setting can help bandwidth performance when 
> repeatedly sending and
>                          receiving large messages with the same buffers over 
> RDMA-based networks
>                          (0 = do not use "leave pinned" protocol, 1 = use 
> "leave pinned"
>                          protocol, -1 = allow network to choose at runtime).
> 
> This seems to indicate that the default is that the network adapter chooses 
> whether to enable or disable MPI. In my case, this default setting turns out 
> to be disastrous.
> Also, the FAQ is somewhat ambiguous about this parameter and states that 
> mpi_leave_pinned is off by default in one place, but that it is -1 (as above) 
> at another place.
> 
> http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned
> http://www.open-mpi.org/faq/?category=openfabrics#setting-mpi-leave-pinned-1.3.2
> 
> Can anyone please explain to me the intricacies of this parameter, and what 
> are the ramifications/benefits of having this particular default value?
> 
> Thanks
> Jens
> 


Reply via email to