[OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.




Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart
Let me try this out and see what happens for me.  But yes, please go ahead and 
send me the complete backtrace.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 11:34 AM
To: us...@open-mpi.org
Cc: KESTENER Pierre
Subject: [OMPI users] OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre
Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60

Pierre.



De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.




Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart
The CUDA-aware support is only available when running with the verbs interface 
to Infiniband.  It does not work with the PSM interface which is being used in 
your installation.
To verify this, you need to disable the usage of PSM.  This can be done in a 
variety of ways, but try running like this:

mpirun -mca pml ob1 .

This will force the use of the verbs support layer (openib) with the CUDA-aware 
support.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60
Pierre.




De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.



---
This email message is for the sole use of the intended recipient(s) and may 
contain

[OMPI users] ofed installation

2013-10-30 Thread Robo Beans
Hello everyone,

I am trying to install ofed-1.5.3.2 on centos 6.4 using install.pl provided
but getting following error:

/lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build
kernel-ib RPM.

// info. about current kernel

*$ uname -a*

Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22
00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


If possible, could someone from the group please direct me what needs to be
done to resolve this issue? Thanks!


Robo


Re: [OMPI users] ofed installation

2013-10-30 Thread Ralph Castain
Looks to me like that's an error from the OFED installer, not something from 
OMPI. Have you tried their mailing list?


On Oct 30, 2013, at 1:05 PM, Robo Beans  wrote:

> 
> Hello everyone,
> 
> I am trying to install ofed-1.5.3.2 on centos 6.4 using install.pl provided 
> but getting following error:
> 
> /lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build 
> kernel-ib RPM.
> 
> // info. about current kernel
> 
> $ uname -a
> 
> Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 
> UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> 
> If possible, could someone from the group please direct me what needs to be 
> done to resolve this issue? Thanks!
> 
> 
> 
> Robo
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] ofed installation

2013-10-30 Thread Robo Beans
I did try ofed forum:
https://www.openfabrics.org/forum/7-installation/882-ofed-1532.html#882

but was wondering if group members faced similar issue as well while
installing ofed and what steps they followed to resolve it? Thanks!


Robo


On Wed, Oct 30, 2013 at 1:22 PM, Ralph Castain  wrote:

> Looks to me like that's an error from the OFED installer, not something
> from OMPI. Have you tried their mailing list?
>
>
> On Oct 30, 2013, at 1:05 PM, Robo Beans  wrote:
>
>   Hello everyone,
>
> I am trying to install ofed-1.5.3.2 on centos 6.4 using install.plprovided 
> but getting following error:
>
> /lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build
> kernel-ib RPM.
>
> // info. about current kernel
>
> *$ uname -a*
>
> Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22
> 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>
>
> If possible, could someone from the group please direct me what needs to
> be done to resolve this issue? Thanks!
>
>
> Robo
>   ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jim Parker
Hello,
  I have recently built a cluster that uses the 64-bit indexing feature of
OpenMPI following the directions at
http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers

My question is what are the new prototypes for the MPI calls ?
specifically
MPI_RECV
MPI_Allgathterv

I'm curious because some off my local variables get killed (set to null)
upon my first call to MPI_RECV.  Typically, this is due (in Fortran) to
someone not setting the 'status' variable to an appropriate array size.

However, my declaration for status is
integer (kind=mpi_int_kind) :: status(MPI_STATUS_SIZE)

A typical call to MPI_Recv is
call MPI_RECV(num_array, length, MPI_INTEGER, 0,0,MPI_COMM_WORLD, status,
mpierr)

where the following definitions are used,
mpi_int_kind=8 (for gcc/gfortran compiler)

integer,parameter :: length = 
integer :: num_array(length)
integer :: mpierr

My review of mpif.h and mpi.h seem to indicate that the functions are
defined as C int types and therefore , I assume, the coercion during the
compile makes the library support 64-bit indexing.  ie. int -> long int

The documentation on MPI_Recv just mentions the prototype for ints (32-bit)
, I can't find anything for 64-bit
http://www.open-mpi.org/doc/v1.6/

Any help would be appreciated.
The output from ompi_info --all is attached.

Cheers,
--Jim Parker

BTW, the code works fine when linked against a 32-bit MPI library.
 Package: Open MPI r...@tsrl-master.army.mil Distribution
Open MPI: 1.6.5
   Open MPI SVN revision: r28673
   Open MPI release date: Jun 26, 2013
Open RTE: 1.6.5
   Open RTE SVN revision: r28673
   Open RTE release date: Jun 26, 2013
OPAL: 1.6.5
   OPAL SVN revision: r28673
   OPAL release date: Jun 26, 2013
 MPI API: 2.1
Ident string: 1.6.5
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6.5)
  MCA memory: linux (MCA v2.0, API v2.0, Component v1.6.5)
   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6.5)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6.5)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.5)
   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.6.5)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6.5)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6.5)
 MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6.5)
   MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6.5)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6.5)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6.5)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6.5)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: self (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6.5)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.6.5)
   MCA mpool: fake (MCA v2.0, API v2.0, Component v1.6.5)
   MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.6.5)
   MCA mpool: sm (MCA v2.0, API v2.0, Component v1.6.5)
 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.6.5)
 MCA pml: csum (MCA v2.0, API v2.0, Component v1.6.5)
 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.6.5)
 MCA pml: v (MCA v2.0, API v2.0, Component v1.6.5)
 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.6.5)
  MCA rcache: vma (MCA v2.0, API v2.0, Component v1.6.5)
 MCA btl: self (MCA v2.0, API v2.0, Component v1.6.5)
 MCA btl: sm (MCA v2.0, API v2.0, Component v1.6.5)
 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.6.5)
MCA topo: unity (MCA v2.0, API v2.0, Component v1.6.5)
 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.6.5)
 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.6.5)
 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.6.5)
 MCA iof: orted (MCA v2.0, API v2.0, Component v1.6.5)
 MCA 

Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Ralph Castain
I believe this has been a long-standing issue with the MPI definitions - they 
specify "int", which on most systems will default to int32_t. Thus, there are 
no prototypes for 64-bit interfaces

On Oct 30, 2013, at 1:35 PM, Jim Parker  wrote:

> Hello,
>   I have recently built a cluster that uses the 64-bit indexing feature of 
> OpenMPI following the directions at
> http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
> 
> My question is what are the new prototypes for the MPI calls ?
> specifically
> MPI_RECV
> MPI_Allgathterv
> 
> I'm curious because some off my local variables get killed (set to null) upon 
> my first call to MPI_RECV.  Typically, this is due (in Fortran) to someone 
> not setting the 'status' variable to an appropriate array size.
> 
> However, my declaration for status is
> integer (kind=mpi_int_kind) :: status(MPI_STATUS_SIZE)
> 
> A typical call to MPI_Recv is
> call MPI_RECV(num_array, length, MPI_INTEGER, 0,0,MPI_COMM_WORLD, status, 
> mpierr)
> 
> where the following definitions are used,
> mpi_int_kind=8 (for gcc/gfortran compiler)  
> 
> integer,parameter :: length = 
> integer :: num_array(length)
> integer :: mpierr
> 
> My review of mpif.h and mpi.h seem to indicate that the functions are defined 
> as C int types and therefore , I assume, the coercion during the compile 
> makes the library support 64-bit indexing.  ie. int -> long int
> 
> The documentation on MPI_Recv just mentions the prototype for ints (32-bit) , 
> I can't find anything for 64-bit
> http://www.open-mpi.org/doc/v1.6/
> 
> Any help would be appreciated.
> The output from ompi_info --all is attached.
> 
> Cheers,
> --Jim Parker
> 
> BTW, the code works fine when linked against a 32-bit MPI library.  
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] ofed installation

2013-10-30 Thread Ralph Castain
Afraid I don't, but maybe someone else here does...

On Oct 30, 2013, at 1:30 PM, Robo Beans  wrote:

> I did try ofed forum: 
> https://www.openfabrics.org/forum/7-installation/882-ofed-1532.html#882
> 
> but was wondering if group members faced similar issue as well while 
> installing ofed and what steps they followed to resolve it? Thanks!
> 
> 
> Robo
> 
> 
> On Wed, Oct 30, 2013 at 1:22 PM, Ralph Castain  wrote:
> Looks to me like that's an error from the OFED installer, not something from 
> OMPI. Have you tried their mailing list?
> 
> 
> On Oct 30, 2013, at 1:05 PM, Robo Beans  wrote:
> 
>> 
>> Hello everyone,
>> 
>> I am trying to install ofed-1.5.3.2 on centos 6.4 using install.pl provided 
>> but getting following error:
>> 
>> /lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build 
>> kernel-ib RPM.
>> 
>> // info. about current kernel
>> 
>> $ uname -a
>> 
>> Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 
>> 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> 
>> If possible, could someone from the group please direct me what needs to be 
>> done to resolve this issue? Thanks!
>> 
>> 
>> 
>> Robo
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] ofed installation

2013-10-30 Thread Jeff Squyres (jsquyres)
I think you'll have better luck with the OFED support channels -- this list is 
mainly about supporting Open MPI.


On Oct 30, 2013, at 4:30 PM, Robo Beans  wrote:

> I did try ofed forum: 
> https://www.openfabrics.org/forum/7-installation/882-ofed-1532.html#882
> 
> but was wondering if group members faced similar issue as well while 
> installing ofed and what steps they followed to resolve it? Thanks!
> 
> 
> Robo
> 
> 
> On Wed, Oct 30, 2013 at 1:22 PM, Ralph Castain  wrote:
> Looks to me like that's an error from the OFED installer, not something from 
> OMPI. Have you tried their mailing list?
> 
> 
> On Oct 30, 2013, at 1:05 PM, Robo Beans  wrote:
> 
>> Hello everyone,
>> 
>> I am trying to install ofed-1.5.3.2 on centos 6.4 using install.pl provided 
>> but getting following error:
>> 
>> /lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build 
>> kernel-ib RPM.
>> 
>> // info. about current kernel
>> 
>> $ uname -a
>> 
>> Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 
>> 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> 
>> If possible, could someone from the group please direct me what needs to be 
>> done to resolve this issue? Thanks!
>> 
>> 
>> 
>> Robo
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jim Parker
Ralph,
  If I understand your comment, there is no standard way to define 64-bit
MPI calls.  So how does OpenMPI recommend I pass information?  Just
declaring some 64-bit integers is not working.
Is there a working example some where?

Cheers,
--Jim



On Wed, Oct 30, 2013 at 3:40 PM, Ralph Castain  wrote:

> I believe this has been a long-standing issue with the MPI definitions -
> they specify "int", which on most systems will default to int32_t. Thus,
> there are no prototypes for 64-bit interfaces
>
> On Oct 30, 2013, at 1:35 PM, Jim Parker  wrote:
>
> Hello,
>   I have recently built a cluster that uses the 64-bit indexing feature of
> OpenMPI following the directions at
>
> http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
>
> My question is what are the new prototypes for the MPI calls ?
> specifically
> MPI_RECV
> MPI_Allgathterv
>
> I'm curious because some off my local variables get killed (set to null)
> upon my first call to MPI_RECV.  Typically, this is due (in Fortran) to
> someone not setting the 'status' variable to an appropriate array size.
>
> However, my declaration for status is
> integer (kind=mpi_int_kind) :: status(MPI_STATUS_SIZE)
>
> A typical call to MPI_Recv is
> call MPI_RECV(num_array, length, MPI_INTEGER, 0,0,MPI_COMM_WORLD, status,
> mpierr)
>
> where the following definitions are used,
> mpi_int_kind=8 (for gcc/gfortran compiler)
>
> integer,parameter :: length = 
> integer :: num_array(length)
> integer :: mpierr
>
> My review of mpif.h and mpi.h seem to indicate that the functions are
> defined as C int types and therefore , I assume, the coercion during the
> compile makes the library support 64-bit indexing.  ie. int -> long int
>
> The documentation on MPI_Recv just mentions the prototype for ints
> (32-bit) , I can't find anything for 64-bit
> http://www.open-mpi.org/doc/v1.6/
>
> Any help would be appreciated.
> The output from ompi_info --all is attached.
>
> Cheers,
> --Jim Parker
>
> BTW, the code works fine when linked against a 32-bit MPI library.
>  ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] ofed installation

2013-10-30 Thread Elken, Tom
Just to give a quick pointer...  RHEL 6.4 is pretty new, and OFED 1.5.3.2 is 
pretty old, so that is likely the root of your issue.

I believe the first OFED that supported RHEL 6.4 , which is roughly = CentOS 
6.4, is OFED 3.5-1:
http://www.openfabrics.org/downloads/OFED/ofed-3.5-1/

What also might work for you (with newer packages and more bug fixes is 3.5-2 
RC2 from
http://www.openfabrics.org/downloads/OFED/ofed-3.5-2/

-Tom

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Robo Beans
Sent: Wednesday, October 30, 2013 1:30 PM
To: Open MPI Users
Subject: Re: [OMPI users] ofed installation

I did try ofed forum: 
https://www.openfabrics.org/forum/7-installation/882-ofed-1532.html#882

but was wondering if group members faced similar issue as well while installing 
ofed and what steps they followed to resolve it? Thanks!


Robo

On Wed, Oct 30, 2013 at 1:22 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Looks to me like that's an error from the OFED installer, not something from 
OMPI. Have you tried their mailing list?


On Oct 30, 2013, at 1:05 PM, Robo Beans 
mailto:robobe...@gmail.com>> wrote:


Hello everyone,

I am trying to install ofed-1.5.3.2 on centos 6.4 using 
install.pl provided but getting following error:

/lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build kernel-ib 
RPM.

// info. about current kernel

$ uname -a

Linux scc-10-2-xx-xx-xyz.com 
2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22 00:31:26 UTC 2013 x86_64 x86_64 x86_64 
GNU/Linux



If possible, could someone from the group please direct me what needs to be 
done to resolve this issue? Thanks!



Robo

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread KESTENER Pierre
Thanks for your help, it is working now; I didn't noticed that limitations.

Best regards,

Pierre Kestener.




De : users [users-boun...@open-mpi.org] de la part de Rolf vandeVaart 
[rvandeva...@nvidia.com]
Date d'envoi : mercredi 30 octobre 2013 17:26
À : Open MPI Users
Objet : Re: [OMPI users] OpenMPI-1.7.3 - cuda support

The CUDA-aware support is only available when running with the verbs interface 
to Infiniband.  It does not work with the PSM interface which is being used in 
your installation.
To verify this, you need to disable the usage of PSM.  This can be done in a 
variety of ways, but try running like this:

mpirun –mca pml ob1 …..

This will force the use of the verbs support layer (openib) with the CUDA-aware 
support.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60
Pierre.




De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.

Re: [OMPI users] ofed installation

2013-10-30 Thread Robo Beans
Thanks guys for your time.

I have latest version of kernel and kernel-devel
(kernel-2.6.32-358.23.2.el6.x86_64 and kernel-devel-2.6.32-358.23.2.el6.x86_64)
but i believe ofed installer was looking for base version of kernel and
kernel-devel (2.6.32-358.el6.x86_64)

root@scc-10-2-xx-xx:/opt/OFED-1.5.3.2# *rpm -qa | grep "kernel"*
kernel-devel-2.6.32-358.23.2.el6.x86_64
kernel-firmware-2.6.32-358.23.2.el6.noarch
dracut-kernel-004-303.el6.noarch
kernel-2.6.32-358.el6.x86_64
kernel-headers-2.6.32-358.23.2.el6.x86_64
kernel-2.6.32-358.14.1.el6.x86_64
kernel-2.6.32-358.23.2.el6.x86_64

In order to take care of this error
"/lib/modules/2.6.32-358.el6.x86_64/build/scripts
is required to build kernel-ib RPM."

I had to rollback to  "kernel-devel-2.6.32-358.el6.x86_64.rpm"

 *rpm -Uvh --oldpackage kernel-devel-2.6.32-358.el6.x86_64.rpm*



Robo


On Wed, Oct 30, 2013 at 2:24 PM, Elken, Tom  wrote:

>  Just to give a quick pointer…  RHEL 6.4 is pretty new, and OFED 1.5.3.2
> is pretty old, so that is likely the root of your issue.
>
> ** **
>
> I believe the first OFED that supported RHEL 6.4 , which is roughly =
> CentOS 6.4, is OFED 3.5-1:
>
> http://www.openfabrics.org/downloads/OFED/ofed-3.5-1/ 
>
> ** **
>
> What also might work for you (with newer packages and more bug fixes is
> 3.5-2 RC2 from 
>
> http://www.openfabrics.org/downloads/OFED/ofed-3.5-2/ 
>
> ** **
>
> -Tom
>
> ** **
>
> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Robo
> Beans
> *Sent:* Wednesday, October 30, 2013 1:30 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] ofed installation
>
> ** **
>
> I did try ofed forum:
> https://www.openfabrics.org/forum/7-installation/882-ofed-1532.html#882***
> *
>
> ** **
>
> but was wondering if group members faced similar issue as well while
> installing ofed and what steps they followed to resolve it? Thanks!
>
> ** **
>
> ** **
>
> Robo
>
> ** **
>
> On Wed, Oct 30, 2013 at 1:22 PM, Ralph Castain  wrote:**
> **
>
> Looks to me like that's an error from the OFED installer, not something
> from OMPI. Have you tried their mailing list?
>
> ** **
>
> ** **
>
> On Oct 30, 2013, at 1:05 PM, Robo Beans  wrote:
>
> ** **
>
> Hello everyone,
>
> I am trying to install ofed-1.5.3.2 on centos 6.4 using install.plprovided 
> but getting following error:
> 
>
> /lib/modules/2.6.32-358.el6.x86_64/build/scripts is required to build
> kernel-ib RPM.
>
> // info. about current kernel
>
> *$ uname -a*
>
> Linux scc-10-2-xx-xx-xyz.com 2.6.32-358.el6.x86_64 #1 SMP Fri Feb 22
> 00:31:26 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>
> ** **
>
> If possible, could someone from the group please direct me what needs to
> be done to resolve this issue? Thanks!
>
> ** **
>
> Robo
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>  ** **
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ** **
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jeff Squyres (jsquyres)
On Oct 30, 2013, at 4:35 PM, Jim Parker  wrote:

>   I have recently built a cluster that uses the 64-bit indexing feature of 
> OpenMPI following the directions at
> http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers

That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for OMPI 1.6.x).

> My question is what are the new prototypes for the MPI calls ?
> specifically
> MPI_RECV
> MPI_Allgathterv

They're the same as they've always been.  

The magic is that the -i8 flag tells the compiler "make all Fortran INTEGERs be 
8 bytes, not (the default) 4."  So Ralph's answer was correct in that all the 
MPI parameters are INTEGERs -- but you can tell the compiler that all INTEGERs 
are 8 bytes, not 4, and therefore get "large" integers.

Note that this means that you need to compile your application with -i8, too.  
That will make *your* INTEGERs also be 8 bytes, and then you'll match what Open 
MPI is doing.

> I'm curious because some off my local variables get killed (set to null) upon 
> my first call to MPI_RECV.  Typically, this is due (in Fortran) to someone 
> not setting the 'status' variable to an appropriate array size.

If you didn't compile your application with -i8, this could well be because 
your application is treating INTEGERs as 4 bytes, but OMPI is treating INTEGERs 
as 8 bytes.  Nothing good can come from that.

If you *did* compile your application with -i8 and you're seeing this kind of 
wonkyness, we should dig deeper and see what's going on.

> My review of mpif.h and mpi.h seem to indicate that the functions are defined 
> as C int types and therefore , I assume, the coercion during the compile 
> makes the library support 64-bit indexing.  ie. int -> long int

FWIW: We actually define a type MPI_Fint; its actual type is determined by 
configure (int or long int, IIRC).  When your Fortran code calls C, we use the 
MPI_Fint type for parameters, and so it will be either a 4 or 8 byte integer 
type.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jim Parker
Jeff and Ralph,
  Ok, I downshifted to a helloWorld example (attached), bottom line after I
hit the MPI_Recv call, my local variable (rank) gets borked.

I have compiled with -m64 -fdefault-integer-8 and even have assigned kind=8
to the integers (which would be the preferred method in my case)

Your help is appreciated.

Cheers,
--Jim



On Wed, Oct 30, 2013 at 4:49 PM, Jeff Squyres (jsquyres)  wrote:

> On Oct 30, 2013, at 4:35 PM, Jim Parker  wrote:
>
> >   I have recently built a cluster that uses the 64-bit indexing feature
> of OpenMPI following the directions at
> >
> http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
>
> That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for OMPI
> 1.6.x).
>
> > My question is what are the new prototypes for the MPI calls ?
> > specifically
> > MPI_RECV
> > MPI_Allgathterv
>
> They're the same as they've always been.
>
> The magic is that the -i8 flag tells the compiler "make all Fortran
> INTEGERs be 8 bytes, not (the default) 4."  So Ralph's answer was correct
> in that all the MPI parameters are INTEGERs -- but you can tell the
> compiler that all INTEGERs are 8 bytes, not 4, and therefore get "large"
> integers.
>
> Note that this means that you need to compile your application with -i8,
> too.  That will make *your* INTEGERs also be 8 bytes, and then you'll match
> what Open MPI is doing.
>
> > I'm curious because some off my local variables get killed (set to null)
> upon my first call to MPI_RECV.  Typically, this is due (in Fortran) to
> someone not setting the 'status' variable to an appropriate array size.
>
> If you didn't compile your application with -i8, this could well be
> because your application is treating INTEGERs as 4 bytes, but OMPI is
> treating INTEGERs as 8 bytes.  Nothing good can come from that.
>
> If you *did* compile your application with -i8 and you're seeing this kind
> of wonkyness, we should dig deeper and see what's going on.
>
> > My review of mpif.h and mpi.h seem to indicate that the functions are
> defined as C int types and therefore , I assume, the coercion during the
> compile makes the library support 64-bit indexing.  ie. int -> long int
>
> FWIW: We actually define a type MPI_Fint; its actual type is determined by
> configure (int or long int, IIRC).  When your Fortran code calls C, we use
> the MPI_Fint type for parameters, and so it will be either a 4 or 8 byte
> integer type.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


mpi-test-64bit.tar.bz2
Description: BZip2 compressed data


Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jeff Squyres (jsquyres)
Can you send the information listed here:

http://www.open-mpi.org/community/help/


On Oct 30, 2013, at 6:22 PM, Jim Parker  wrote:

> Jeff and Ralph,
>   Ok, I downshifted to a helloWorld example (attached), bottom line after I 
> hit the MPI_Recv call, my local variable (rank) gets borked.
> 
> I have compiled with -m64 -fdefault-integer-8 and even have assigned kind=8 
> to the integers (which would be the preferred method in my case)
> 
> Your help is appreciated.
> 
> Cheers,
> --Jim
> 
> 
> 
> On Wed, Oct 30, 2013 at 4:49 PM, Jeff Squyres (jsquyres)  
> wrote:
> On Oct 30, 2013, at 4:35 PM, Jim Parker  wrote:
> 
> >   I have recently built a cluster that uses the 64-bit indexing feature of 
> > OpenMPI following the directions at
> > http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
> 
> That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for OMPI 
> 1.6.x).
> 
> > My question is what are the new prototypes for the MPI calls ?
> > specifically
> > MPI_RECV
> > MPI_Allgathterv
> 
> They're the same as they've always been.
> 
> The magic is that the -i8 flag tells the compiler "make all Fortran INTEGERs 
> be 8 bytes, not (the default) 4."  So Ralph's answer was correct in that all 
> the MPI parameters are INTEGERs -- but you can tell the compiler that all 
> INTEGERs are 8 bytes, not 4, and therefore get "large" integers.
> 
> Note that this means that you need to compile your application with -i8, too. 
>  That will make *your* INTEGERs also be 8 bytes, and then you'll match what 
> Open MPI is doing.
> 
> > I'm curious because some off my local variables get killed (set to null) 
> > upon my first call to MPI_RECV.  Typically, this is due (in Fortran) to 
> > someone not setting the 'status' variable to an appropriate array size.
> 
> If you didn't compile your application with -i8, this could well be because 
> your application is treating INTEGERs as 4 bytes, but OMPI is treating 
> INTEGERs as 8 bytes.  Nothing good can come from that.
> 
> If you *did* compile your application with -i8 and you're seeing this kind of 
> wonkyness, we should dig deeper and see what's going on.
> 
> > My review of mpif.h and mpi.h seem to indicate that the functions are 
> > defined as C int types and therefore , I assume, the coercion during the 
> > compile makes the library support 64-bit indexing.  ie. int -> long int
> 
> FWIW: We actually define a type MPI_Fint; its actual type is determined by 
> configure (int or long int, IIRC).  When your Fortran code calls C, we use 
> the MPI_Fint type for parameters, and so it will be either a 4 or 8 byte 
> integer type.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jim Parker
Jeff,
  Here's what I know:
1.  Checked FAQs.  Done
2.  Version 1.6.5
3. config.log file has been removed by the sysadmin...
4. ompi_info -a from head node is in attached as headnode.out
5. N/A
6. compute node info in attached as compute-x-yy.out
7. As discussed, local variables are being overwritten after calls to
MPI_RECV from Fortran code
8. ifconfig output from head node and computes listed as *-ifconfig.out

Cheers,
--Jim


On Wed, Oct 30, 2013 at 5:29 PM, Jeff Squyres (jsquyres)  wrote:

> Can you send the information listed here:
>
> http://www.open-mpi.org/community/help/
>
>
> On Oct 30, 2013, at 6:22 PM, Jim Parker  wrote:
>
> > Jeff and Ralph,
> >   Ok, I downshifted to a helloWorld example (attached), bottom line
> after I hit the MPI_Recv call, my local variable (rank) gets borked.
> >
> > I have compiled with -m64 -fdefault-integer-8 and even have assigned
> kind=8 to the integers (which would be the preferred method in my case)
> >
> > Your help is appreciated.
> >
> > Cheers,
> > --Jim
> >
> >
> >
> > On Wed, Oct 30, 2013 at 4:49 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > On Oct 30, 2013, at 4:35 PM, Jim Parker 
> wrote:
> >
> > >   I have recently built a cluster that uses the 64-bit indexing
> feature of OpenMPI following the directions at
> > >
> http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
> >
> > That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for OMPI
> 1.6.x).
> >
> > > My question is what are the new prototypes for the MPI calls ?
> > > specifically
> > > MPI_RECV
> > > MPI_Allgathterv
> >
> > They're the same as they've always been.
> >
> > The magic is that the -i8 flag tells the compiler "make all Fortran
> INTEGERs be 8 bytes, not (the default) 4."  So Ralph's answer was correct
> in that all the MPI parameters are INTEGERs -- but you can tell the
> compiler that all INTEGERs are 8 bytes, not 4, and therefore get "large"
> integers.
> >
> > Note that this means that you need to compile your application with -i8,
> too.  That will make *your* INTEGERs also be 8 bytes, and then you'll match
> what Open MPI is doing.
> >
> > > I'm curious because some off my local variables get killed (set to
> null) upon my first call to MPI_RECV.  Typically, this is due (in Fortran)
> to someone not setting the 'status' variable to an appropriate array size.
> >
> > If you didn't compile your application with -i8, this could well be
> because your application is treating INTEGERs as 4 bytes, but OMPI is
> treating INTEGERs as 8 bytes.  Nothing good can come from that.
> >
> > If you *did* compile your application with -i8 and you're seeing this
> kind of wonkyness, we should dig deeper and see what's going on.
> >
> > > My review of mpif.h and mpi.h seem to indicate that the functions are
> defined as C int types and therefore , I assume, the coercion during the
> compile makes the library support 64-bit indexing.  ie. int -> long int
> >
> > FWIW: We actually define a type MPI_Fint; its actual type is determined
> by configure (int or long int, IIRC).  When your Fortran code calls C, we
> use the MPI_Fint type for parameters, and so it will be either a 4 or 8
> byte integer type.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


ompi.info.tar.bz2
Description: BZip2 compressed data


Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Martin Siegert
Hi Jim,

I have quite a bit experience with compiling openmpi for dirac.
Here is what I use to configure openmpi:

./configure --prefix=$instdir \
--disable-silent-rules \
--enable-mpirun-prefix-by-default \
--with-threads=posix \
--enable-cxx-exceptions \
--with-tm=$torquedir \
--with-wrapper-ldflags="-Wl,-rpath,${instdir}/lib" \
--with-openib \
--with-hwloc=$hwlocdir \
CC=gcc \
CXX=g++ \
FC="$FC" \
F77="$FC" \
CFLAGS="-O3" \
CXXFLAGS="-O3" \
FFLAGS="-O3 $I8FLAG" \
FCFLAGS="-O3 $I8FLAG"

You need to set FC to either ifort or gfortran (those are the two compilers
that I have used) and set I8FLAG to -fdefault-integer-8 for gfortran or
-i8 for ifort.
Set torquedir to the directory where torque is installed ($torquedir/lib
must contain libtorque.so), if you are running jobs under torque; otherwise
remove the --with-tm=... line.
Set hwlocdir to the directory where you have hwloc installed. You many not
need the -with-hwloc=... option because openmpi comes with a hwloc version
(I don't have experience with that because we install hwloc independently).
Set instdir to the directory where you what to install openmpi.
You may or may not need the --with-openib option depending on whether
you have an Infiniband interconnect.

After configure/make/make install this so compiled version can be used
with dirac without changing the dirac source code.
(there is one caveat: you should make sure that all "count" variables
in MPI calls in dirac are smaller than 2^31-1. I have run into a few cases
when that is not the case; this problem can be overcome by replacing
MPI_Allreduce calls in dirac with a wrapper that calls MPI_Allreduce
repeatedly). This is what I use to setup dirac:

export PATH=$instdir/bin
./setup --prefix=$diracinstdir \
--fc=mpif90 \
--cc=mpicc \
--int64 \
--explicit-libs="-lmkl_intel_ilp64 -lmkl_sequential -lmkl_core"

where $instdir is the directory where you installed openmpi from above.

I would never use the so-compiled openmpi version for anything other
than dirac though. I am not saying that it cannot work (at a minimum
you need to compile Fortran programs with the appropriate I8FLAG),
but it is an unnecessary complication: I have not encountered a piece
of software other than dirac that requires this.

Cheers,
Martin

-- 
Martin Siegert
Head, Research Computing
WestGrid/ComputeCanada Site Lead
Simon Fraser University
Burnaby, British Columbia
Canada

On Wed, Oct 30, 2013 at 06:00:56PM -0500, Jim Parker wrote:
> 
>Jeff,
>  Here's what I know:
>1.  Checked FAQs.  Done
>2.  Version 1.6.5
>3. config.log file has been removed by the sysadmin...
>4. ompi_info -a from head node is in attached as headnode.out
>5. N/A
>6. compute node info in attached as compute-x-yy.out
>7. As discussed, local variables are being overwritten after calls to
>MPI_RECV from Fortran code
>8. ifconfig output from head node and computes listed as *-ifconfig.out
>Cheers,
>--Jim
> 
>On Wed, Oct 30, 2013 at 5:29 PM, Jeff Squyres (jsquyres)
><[1]jsquy...@cisco.com> wrote:
> 
>  Can you send the information listed here:
>  [2]http://www.open-mpi.org/community/help/
> 
>On Oct 30, 2013, at 6:22 PM, Jim Parker <[3]jimparker96...@gmail.com>
>wrote:
>> Jeff and Ralph,
>>   Ok, I downshifted to a helloWorld example (attached), bottom line
>after I hit the MPI_Recv call, my local variable (rank) gets borked.
>>
>> I have compiled with -m64 -fdefault-integer-8 and even have assigned
>kind=8 to the integers (which would be the preferred method in my case)
>>
>> Your help is appreciated.
>>
>> Cheers,
>> --Jim
>>
>>
>>
>> On Wed, Oct 30, 2013 at 4:49 PM, Jeff Squyres (jsquyres)
><[4]jsquy...@cisco.com> wrote:
>> On Oct 30, 2013, at 4:35 PM, Jim Parker <[5]jimparker96...@gmail.com>
>wrote:
>>
>> >   I have recently built a cluster that uses the 64-bit indexing
>feature of OpenMPI following the directions at
>> >
>[6]http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_fo
>r_64-bit_integers
>>
>> That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for
>OMPI 1.6.x).
>>
>> > My question is what are the new prototypes for the MPI calls ?
>> > specifically
>> > MPI_RECV
>> > MPI_Allgathterv
>>
>> They're the same as they've always been.
>>
>> The magic is that the -i8 flag tells the compiler "make all Fortran
>INTEGERs be 8 bytes, not (the default) 4."  So Ralph's answer was
>correct in that all the MPI parameters are INTEGERs -- but you can tell
>the compiler that all INTEGERs are 8 bytes, not 4, and therefore get
>"large" integers.
>>
>> Note that this means tha

Re: [OMPI users] Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jeff Squyres (jsquyres)
I've compiled your application and seen similar behavior (I snipped one of the 
writes and abbreviated another):

-
 Iam = 3 received 8
 Iam = 0 received 3
 Iam = 1 received 8
 Iam = 2 received 8
-

The rank 0 output line is somewhat bogus; it's just the last value that rank 0 
sent.

The fact that 1, 2, and 3 are displaying 8 seems damning.

However, I *think* you're just seeing an over-aggressive Fortran optimizing 
compiler.

I say this because if I access tempInt(1) in any of the non-zero MCW rank 
processes before the MPI_RECV, then everything works fine.  For example, if I 
add the following line as the first executable line:


 tempInt(1) = 38


Then the program runs as expected:

-
 Iam = 3 received 3
 Iam = 0 received 3
 Iam = 1 received 1
 Iam = 2 received 2
-

Indeed, even if I write/print tempInt(1) before the MPI_RECV, then it works as 
expected.  Or even call MPI_Address on tempInt(1).

I'm not enough of a wizard to know for sure, but I *think* that there are some 
funny rules in Fortran about how the compiler can treat memory that it doesn't 
know for sure has been initialized.  And since mpif.h doesn't provide a 
prototype for MPI_RECV, the compiler doesn't know that the buffer is an OUT 
variable, and therefore it should disregard what was in there beforehand, etc.

I'm not 100% sure of this, though -- and I'm a little puzzled as to why the 
behavior would be different between 32 and 64 bit builds.  Perhaps a Fortran 
wizard can comment here...?


On Oct 30, 2013, at 7:00 PM, Jim Parker  wrote:

> Jeff,
>   Here's what I know:
> 1.  Checked FAQs.  Done
> 2.  Version 1.6.5
> 3. config.log file has been removed by the sysadmin...
> 4. ompi_info -a from head node is in attached as headnode.out
> 5. N/A
> 6. compute node info in attached as compute-x-yy.out
> 7. As discussed, local variables are being overwritten after calls to 
> MPI_RECV from Fortran code
> 8. ifconfig output from head node and computes listed as *-ifconfig.out
> 
> Cheers,
> --Jim
> 
> 
> On Wed, Oct 30, 2013 at 5:29 PM, Jeff Squyres (jsquyres)  
> wrote:
> Can you send the information listed here:
> 
> http://www.open-mpi.org/community/help/
> 
> 
> On Oct 30, 2013, at 6:22 PM, Jim Parker  wrote:
> 
> > Jeff and Ralph,
> >   Ok, I downshifted to a helloWorld example (attached), bottom line after I 
> > hit the MPI_Recv call, my local variable (rank) gets borked.
> >
> > I have compiled with -m64 -fdefault-integer-8 and even have assigned kind=8 
> > to the integers (which would be the preferred method in my case)
> >
> > Your help is appreciated.
> >
> > Cheers,
> > --Jim
> >
> >
> >
> > On Wed, Oct 30, 2013 at 4:49 PM, Jeff Squyres (jsquyres) 
> >  wrote:
> > On Oct 30, 2013, at 4:35 PM, Jim Parker  wrote:
> >
> > >   I have recently built a cluster that uses the 64-bit indexing feature 
> > > of OpenMPI following the directions at
> > > http://wiki.chem.vu.nl/dirac/index.php/How_to_build_MPI_libraries_for_64-bit_integers
> >
> > That should be correct (i.e., passing -i8 in FFLAGS and FCFLAGS for OMPI 
> > 1.6.x).
> >
> > > My question is what are the new prototypes for the MPI calls ?
> > > specifically
> > > MPI_RECV
> > > MPI_Allgathterv
> >
> > They're the same as they've always been.
> >
> > The magic is that the -i8 flag tells the compiler "make all Fortran 
> > INTEGERs be 8 bytes, not (the default) 4."  So Ralph's answer was correct 
> > in that all the MPI parameters are INTEGERs -- but you can tell the 
> > compiler that all INTEGERs are 8 bytes, not 4, and therefore get "large" 
> > integers.
> >
> > Note that this means that you need to compile your application with -i8, 
> > too.  That will make *your* INTEGERs also be 8 bytes, and then you'll match 
> > what Open MPI is doing.
> >
> > > I'm curious because some off my local variables get killed (set to null) 
> > > upon my first call to MPI_RECV.  Typically, this is due (in Fortran) to 
> > > someone not setting the 'status' variable to an appropriate array size.
> >
> > If you didn't compile your application with -i8, this could well be because 
> > your application is treating INTEGERs as 4 bytes, but OMPI is treating 
> > INTEGERs as 8 bytes.  Nothing good can come from that.
> >
> > If you *did* compile your application with -i8 and you're seeing this kind 
> > of wonkyness, we should dig deeper and see what's going on.
> >
> > > My review of mpif.h and mpi.h seem to indicate that the functions are 
> > > defined as C int types and therefore , I assume, the coercion during the 
> > > compile makes the library support 64-bit indexing.  ie. int -> long int
> >
> > FWIW: We actually define a type MPI_Fint; its actua

[OMPI users] Fwd: Prototypes for Fortran MPI_ commands using 64-bit indexing

2013-10-30 Thread Jim Parker
Ok, all, where to begin...

Perhaps I should start with the most pressing issue for me.  I need 64-bit
indexing

@Martin,
   you indicated that even if I get this up and running, the MPI library
still uses signed 32-bit ints to count (your term), or index (my term) the
recvbuffer lengths.  More concretely,
in a call to MPI_Allgatherv( buffer, count, MPI_Integer, recvbuf,
recv-count, displ, MPI_integer, MPI_COMM_WORLD, status, mpierr): count,
recvcounts, and displs must be  32-bit integers, not 64-bit.  Actually, all
I need is displs to hold 64-bit values...
If this is true, then compiling OpenMPI this way is not a solution.  I'll
have to restructure my code to collect 31-bit chunks...
Not that it matters, but I'm not using DIRAC, but a custom code to compute
circuit analyses.

@Jeff,
  Interesting, your runtime behavior has a different error than mine.  You
have problems with the passed variable tempInt, which would make sense for
the reasons you gave.  However, my problem involves the fact that the local
variable "rank" gets overwritten by a memory corruption after MPI_RECV is
called.

Re: config.log. I will try to have the admin guy recompile tomorrow and see
if I can get the log for you.

BTW, I'm using the gcc 4.7.2 compiler suite on a Rocks 5.4 HPC cluster.  I
use the options -m64 and -fdefault-integer-8

Cheers,
--Jim



On Wed, Oct 30, 2013 at 7:36 PM, Martin Siegert  wrote:

> Hi Jim,
>
> I have quite a bit experience with compiling openmpi for dirac.
> Here is what I use to configure openmpi:
>
> ./configure --prefix=$instdir \
> --disable-silent-rules \
> --enable-mpirun-prefix-by-default \
> --with-threads=posix \
> --enable-cxx-exceptions \
> --with-tm=$torquedir \
> --with-wrapper-ldflags="-Wl,-rpath,${instdir}/lib" \
> --with-openib \
> --with-hwloc=$hwlocdir \
> CC=gcc \
> CXX=g++ \
> FC="$FC" \
> F77="$FC" \
> CFLAGS="-O3" \
> CXXFLAGS="-O3" \
> FFLAGS="-O3 $I8FLAG" \
> FCFLAGS="-O3 $I8FLAG"
>
> You need to set FC to either ifort or gfortran (those are the two compilers
> that I have used) and set I8FLAG to -fdefault-integer-8 for gfortran or
> -i8 for ifort.
> Set torquedir to the directory where torque is installed ($torquedir/lib
> must contain libtorque.so), if you are running jobs under torque; otherwise
> remove the --with-tm=... line.
> Set hwlocdir to the directory where you have hwloc installed. You many not
> need the -with-hwloc=... option because openmpi comes with a hwloc version
> (I don't have experience with that because we install hwloc independently).
> Set instdir to the directory where you what to install openmpi.
> You may or may not need the --with-openib option depending on whether
> you have an Infiniband interconnect.
>
> After configure/make/make install this so compiled version can be used
> with dirac without changing the dirac source code.
> (there is one caveat: you should make sure that all "count" variables
> in MPI calls in dirac are smaller than 2^31-1. I have run into a few cases
> when that is not the case; this problem can be overcome by replacing
> MPI_Allreduce calls in dirac with a wrapper that calls MPI_Allreduce
> repeatedly). This is what I use to setup dirac:
>
> export PATH=$instdir/bin
> ./setup --prefix=$diracinstdir \
> --fc=mpif90 \
> --cc=mpicc \
> --int64 \
> --explicit-libs="-lmkl_intel_ilp64 -lmkl_sequential -lmkl_core"
>
> where $instdir is the directory where you installed openmpi from above.
>
> I would never use the so-compiled openmpi version for anything other
> than dirac though. I am not saying that it cannot work (at a minimum
> you need to compile Fortran programs with the appropriate I8FLAG),
> but it is an unnecessary complication: I have not encountered a piece
> of software other than dirac that requires this.
>
> Cheers,
> Martin
>
> --
> Martin Siegert
> Head, Research Computing
> WestGrid/ComputeCanada Site Lead
> Simon Fraser University
> Burnaby, British Columbia
> Canada
>
> On Wed, Oct 30, 2013 at 06:00:56PM -0500, Jim Parker wrote:
> >
> >Jeff,
> >  Here's what I know:
> >1.  Checked FAQs.  Done
> >2.  Version 1.6.5
> >3. config.log file has been removed by the sysadmin...
> >4. ompi_info -a from head node is in attached as headnode.out
> >5. N/A
> >6. compute node info in attached as compute-x-yy.out
> >7. As discussed, local variables are being overwritten after calls to
> >MPI_RECV from Fortran code
> >8. ifconfig output from head node and computes listed as
> *-ifconfig.out
> >Cheers,
> >--Jim
> >
> >On Wed, Oct 30, 2013 at 5:29 PM, Jeff Squyres (jsquyres)
> ><[1]jsquy...@cisco.com> wrote:
> >
> >  Can you send the information listed here:
> >  [2]http://www.open-mpi.org/community/help/
> >
> >On Oct 30, 2013, at 6:22