from:"KESTENER Pierre"

[OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60

Pierre.



De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre

Thanks for your help, it is working now; I didn't noticed that limitations.

Best regards,

Pierre Kestener.




De : users [users-boun...@open-mpi.org] de la part de Rolf vandeVaart 
[rvandeva...@nvidia.com]
Date d'envoi : mercredi 30 octobre 2013 17:26
À : Open MPI Users
Objet : Re: [OMPI users] OpenMPI-1.7.3 - cuda support

The CUDA-aware support is only available when running with the verbs interface 
to Infiniband.  It does not work with the PSM interface which is being used in 
your installation.
To verify this, you need to disable the usage of PSM.  This can be done in a 
variety of ways, but try running like this:

mpirun –mca pml ob1 …..

This will force the use of the verbs support layer (openib) with the CUDA-aware 
support.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60
Pierre.




De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org<mailto:us...@open-mpi.org>
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aa

[OMPI users] OpenMPI-1.7.3 - cuda support

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

3 matches

Site Navigation

Mail list logo

Footer information