Re: [OMPI users] CuEventCreate Failed...

Steven Eliuk Mon, 20 Oct 2014 13:59:16 -0400 (EDT)

Thanks for your quick response,

1)mpiexec --allow-run-as-root --mca btl_openib_want_cuda_gdr 1 --mca 
btl_openib_cuda_rdma_limit 60000 --mca mpi_common_cuda_event_max 1000 -n 5 
test/RunTests
2)Yes, cuda aware support using Mellanox IB,
3)Yes, we have the ability to use several version of OpenMPI, Mvapich2, etc.


Also, our defaults for openmpi-mca-params.conf are:

mtl=^mxm

btl=^usnic,tcp

btl_openib_flags=1


service nv_peer_mem status

nv_peer_mem module is loaded.

Kindest Regards,
—
Steven Eliuk,


From: Rolf vandeVaart <rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>>
Reply-To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
List-Post: users@lists.open-mpi.org
Date: Sunday, October 19, 2014 at 7:33 PM
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] CuEventCreate Failed...

The error 304 corresponds to CUDA_ERRROR_OPERATNG_SYSTEM which means an OS call 
failed.  However, I am not sure how that relates to the call that is getting 
the error.
Also, the last error you report is from MVAPICH2-GDR, not from Open MPI.  I 
guess then I have a few questions.


1.       Can you supply your configure line for Open MPI?

2.       Are you making use of CUDA-aware support?

3.       Are you set up so that users can use both Open MPI and MVAPICH2?

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Friday, October 17, 2014 6:48 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] CuEventCreate Failed...

Hi All,

We have run into issues, that don’t really seem to materialize into incorrect 
results, nonetheless, we hope to figure out why we are getting them.

We have several environments with test from one machine, with say 1-16 
processes per node, to several machines with 1-16 processes. All systems are 
certified from Nvidia and use Nvidia Tesla k40 GPUs.

We notice frequent situations of the following,

--------------------------------------------------------------------------

The call to cuEventCreate failed. This is a unrecoverable error and will

cause the program to abort.

  Hostname:                     aHost

  cuEventCreate return value:   304

Check the cuda.h file for what the return value means.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will

cause the program to abort.

  cuIpcGetEventHandle return value:   304

Check the cuda.h file for what the return value means.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol

cannot be used.

  cuIpcGetMemHandle return value:   304

  address: 0x700fd0400

Check the cuda.h file for what the return value means. Perhaps a reboot

of the node will clear the problem.

--------------------------------------------------------------------------

Now, our test suite still verifies results but this does cause the following 
when it happens,

The call to cuEventDestory failed. This is a unrecoverable error and will

cause the program to abort.

  cuEventDestory return value:   400

Check the cuda.h file for what the return value means.

--------------------------------------------------------------------------

-------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

-------------------------------------------------------

--------------------------------------------------------------------------

mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:



  Process name: [[37290,1],2]

  Exit code:    1


We have traced the code back to the following files:
-ompi/mca/common/cuda/common_cuda.c :: 
mca_common_cuda_construct_event_and_handle()

We also know the the following:
-it happens on every machine on the very first entry to the function previously 
mentioned,
-does not happen if the buffer size is under 128 bytes… likely a different 
mech. Used for the IPC,

Last, here is an intermittent one and it produces a lot failed tests in our 
suite… when in fact they are solid, besides this error. Cause notification, 
annoyances and it would be nice to clean it up.

mpi_rank_3][cudaipc_allocate_ipc_region] 
[src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_ipc.c:487] cuda failed with 
mapping of buffer object failed


We have not been able to duplicate these errors in other MPI libs,

Thank you for your time & looking forward to your response,


Kindest Regards,
—
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.

________________________________
This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.
________________________________

Re: [OMPI users] CuEventCreate Failed...

Reply via email to