Re: [OMPI users] MPIRUN Error on Mac pro i7 laptop and linux desktop
christophe petit wrote: Thanks for your answers, the execution of this parallel program works fine at my work, but we used MPICH2. I thought this will run with OPEN-MPI too. In your input deck, how big are x_domains and y_domains -- that is, iconf(3) and iconf(4)? Do they have to be changed if you change the number of processes you run on? Off hand, it looks like x_domains*y_domains = iconf(3)*iconf(4) should equal nproc. If you can run with nproc=1 and don't change the input deck, you won't be able to run on nproc/=1. Given that the problem is in MPI_Cart_shift, could you produce a much smaller program that illustrates the error you're trying to understand? Here is the f90 source where MPI_CART_SHIFT is called : program heat !** ! ! This program solves the heat equation on the unit square [0,1]x[0,1] ! | du/dt - Delta(u) = 0 ! | u/gamma = cste ! by implementing a explicit scheme. ! The discretization is done using a 5 point finite difference scheme ! and the domain is decomposed into sub-domains. ! The PDE is discretized using a 5 point finite difference scheme ! over a (x_dim+2)*(x_dim+2) grid including the end points ! correspond to the boundary points that are stored. ! ! The data on the whole domain are stored in ! the following way : ! ! y ! ! d | | ! i | | ! r | | ! e | | ! c | | ! t | | ! i | x20 | ! o /\ | | ! n | | x10 | ! | | | ! | | x00 x01 x02 ... | ! | ! ---> x direction x(*,j) ! ! The boundary conditions are stored in the following submatrices ! ! ! x(1:x_dim, 0) ---> left temperature ! x(1:x_dim, x_dim+1) ---> right temperature ! x(0, 1:x_dim) ---> top temperature ! x(x_dim+1, 1:x_dim) ---> bottom temperature ! !** implicit none include 'mpif.h' ! size of the discretization integer :: x_dim, nb_iter double precision, allocatable :: x(:,:),b(:,:),x0(:,:) double precision :: dt, h, epsilon double precision :: resLoc, result, t, tstart, tend ! integer :: i,j integer :: step, maxStep integer :: size_x, size_y, me, x_domains,y_domains integer :: iconf(5), size_x_glo double precision conf(2) ! ! MPI variables integer :: nproc, infompi, comm, comm2d, lda, ndims INTEGER, DIMENSION(2) :: dims LOGICAL, DIMENSION(2) :: periods LOGICAL, PARAMETER :: reorganisation = .false. integer :: row_type integer, parameter :: nbvi=4 integer, parameter :: S=1, E=2, N=3, W=4 integer, dimension(4) :: neighBor ! intrinsic abs ! ! call MPI_INIT(infompi) comm = MPI_COMM_WORLD call MPI_COMM_SIZE(comm,nproc,infompi) call MPI_COMM_RANK(comm,me,infompi) ! ! if (me.eq.0) then call readparam(iconf, conf) endif call MPI_BCAST(iconf,5,MPI_INTEGER,0,comm,infompi) call MPI_BCAST(conf,2,MPI_DOUBLE_PRECISION,0,comm,infompi) ! size_x = iconf(1) size_y = iconf(1) x_domains = iconf(3) y_domains = iconf(4) maxStep = iconf(5) dt = conf(1) epsilon = conf(2) ! size_x_glo = x_domains*size_x+2 h = 1.0d0/dble(size_x_glo) dt = 0.25*h*h ! ! lda = size_y+2 allocate(x(0:size_y+1,0:size_x+1)) allocate(x0(0:size_y+1,0:size_x+1)) allocate(b(0:size_y+1,0:size_x+1)) ! ! Create 2D cartesian grid periods(:) = .false. ndims = 2 dims(1)=x_domains dims(2)=y_domains CALL MPI_CART_CREATE(MPI_COMM_WORLD, ndims, dims, periods, & reorganisation,comm2d,infompi) ! ! Identify neighbors ! NeighBor(:) = MPI_PROC_NULL ! Left/West and right/Est neigbors CALL MPI_CART_SHIFT(comm2d,0,1,NeighBor(W),NeighBor(E),infompi) ! Bottom/South and Upper/North neigbors CALL MPI_CART_SHIFT(comm2d,1,1,NeighBor(S),NeighBor(N),infompi) ! ! Create row data type to coimmunicate with South and North neighbors ! CALL MPI_TYPE_VECTOR(size_x, 1, size_y+2, MPI_DOUBLE_PRECISION, row_type,infompi) CALL MPI_TYPE_COMMIT(row_type, infompi) ! ! initialization ! call initvalues(x0, b, size_x+1, size_x ) ! ! Update the boundaries ! call updateBound(x0,size_x,size_x, NeighBor, comm2d, row_type) step = 0 t = 0.0 ! tstar
[OMPI users] Implementing a new BTL module in MCA
Deal all, I need to implement an MPI layer on top of a message passing library which is currently used in a particular device where I have to run MPI programs ( very vague, I know :) ). Instead of reinventing the wheel, my idea was to reuse most of the Open MPI implementation and just add a new layer to support my custom device. I guess that extending the Byte Transfer Layer of the Modular Component Architecture should make the job. Right? Anyway, before starting wasting my time looking for documentation I wanted to have some pointers to documentation regarding extension of Open MPI. Which are the interfaces I have to extend? Is there any "hello world" example on how to do it? many thanks, Simone
Re: [OMPI users] OpenIB Error in ibv_create_srq
Hi: In response to my own question, by studying the file mca-btl-openib-device-params.ini, I discovered that this option in OMPI-1.4.2: -mca btl_openib_receive_queues P,65536,256,192,128 was sufficient to prevent OMPI from trying to create shared receive queues and allowed my application to run to completion using the IB hardware. I guess my question now is: What do these numbers mean? Presumably the size (or counts?) of buffers to allocate? Are there limits or a way to tune these values? Thanks, Allen On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote: > Hi Terry: > It is indeed the case that the openib BTL has not been initialized. I > ran with your tcp-disabled MCA option and it aborted in MPI_Init. > > The OFED stack is what's included in RHEL4. It appears to be made up of > the RPMs: > openib-1.4-1.el4 > opensm-3.2.5-1.el4 > libibverbs-1.1.2-1.el4 > > How can I determine if srq is supported? Is there an MCA option to > defeat it? (Our in-house cluster has more recent Mellanox IB hardware > and is running this same IB stack and ompi 1.4.2 works OK, so I suspect > srq is supported by the OpenFabrics stack. Perhaps.) > > Thanks, > Allen > > On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: > > My guess is from the message below saying "(openib) BTL failed to > > initialize" that the code is probably running over tcp. To > > absolutely prove this you can specify to only use the openib, sm and > > self btls to eliminate the tcp btl. To do that you add the following > > to the mpirun line "-mca btl openib,sm,self". I believe with that > > specification the code will abort and not run to completion. > > > > What version of the OFED stack are you using? I wonder if srq is > > supported on your system or not? > > > > --td > > > > Allen Barnett wrote: > > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application > > > on a cluster of machines running RHEL4 with the standard OFED stack. The > > > HCAs are identified as: > > > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > > > > > ibv_devinfo says that one port on the HCAs is active but the other is > > > down: > > > > > > hca_id: mthca0 > > > fw_ver: 3.0.2 > > > node_guid: 0006:6a00:9800:4c78 > > > sys_image_guid: 0006:6a00:9800:4c78 > > > vendor_id: 0x066a > > > vendor_part_id: 23108 > > > hw_ver: 0xA1 > > > phys_port_cnt: 2 > > > port: 1 > > > state: active (4) > > > max_mtu:2048 (4) > > > active_mtu: 2048 (4) > > > sm_lid: 1 > > > port_lid: 26 > > > port_lmc: 0x00 > > > > > > port: 2 > > > state: down (1) > > > max_mtu:2048 (4) > > > active_mtu: 512 (2) > > > sm_lid: 0 > > > port_lid: 0 > > > port_lmc: 0x00 > > > > > > > > > When the OMPI application is run, it prints the error message: > > > > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > create an internal queue. This typically indicates a failed > > > OpenFabrics installation, faulty hardware, or that Open MPI is > > > attempting to use a feature that is not supported on your hardware > > > (i.e., is a shared receive queue specified in the > > > btl_openib_receive_queues MCA parameter with a device that does not > > > support it?). The failure occured here: > > > > > > Local host: machine001.lan > > > OMPI > > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 > > > Function:ibv_create_srq() > > > Error: Invalid argument (errno=22) > > > Device: mthca0 > > > > > > You may need to consult with your system administrator to get this > > > problem fixed. > > > > > > > > > The full log of a run with "btl_openib_verbose 1" is attached. My > > > application appears to run to completion, but I can't tell if it's just > > > running on TCP and not using the IB hardware. > > > > > > I would appreciate any suggestions on how to proceed to fix this error. > > > > > > Thanks, > > > Allen > > > -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930
Re: [OMPI users] OpenIB Error in ibv_create_srq
Sorry, I didn't see your prior question glad you found the btl_openib_receive_queues parameter. There is not a faq entry for this but I found the following in the openib btl help file that spells out the parameters when using Per-peer receive queue (ie receive queue setting with "P" as the first argument). Per-peer receive queues require between 2 and 5 parameters: 1. Buffer size in bytes (mandatory) 2. Number of buffers (mandatory) 3. Low buffer count watermark (optional; defaults to (num_buffers / 2)) 4. Credit window size (optional; defaults to (low_watermark / 2)) 5. Number of buffers reserved for credit messages (optional; defaults to (num_buffers*2-1)/credit_window) Example: P,128,256,128,16 - 128 byte buffers - 256 buffers to receive incoming MPI messages - When the number of available buffers reaches 128, re-post 128 more buffers to reach a total of 256 - If the number of available credits reaches 16, send an explicit credit message to the sender - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are reserved for explicit credit messages --td Allen Barnett wrote: Hi: In response to my own question, by studying the file mca-btl-openib-device-params.ini, I discovered that this option in OMPI-1.4.2: -mca btl_openib_receive_queues P,65536,256,192,128 was sufficient to prevent OMPI from trying to create shared receive queues and allowed my application to run to completion using the IB hardware. I guess my question now is: What do these numbers mean? Presumably the size (or counts?) of buffers to allocate? Are there limits or a way to tune these values? Thanks, Allen On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote: Hi Terry: It is indeed the case that the openib BTL has not been initialized. I ran with your tcp-disabled MCA option and it aborted in MPI_Init. The OFED stack is what's included in RHEL4. It appears to be made up of the RPMs: openib-1.4-1.el4 opensm-3.2.5-1.el4 libibverbs-1.1.2-1.el4 How can I determine if srq is supported? Is there an MCA option to defeat it? (Our in-house cluster has more recent Mellanox IB hardware and is running this same IB stack and ompi 1.4.2 works OK, so I suspect srq is supported by the OpenFabrics stack. Perhaps.) Thanks, Allen On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: My guess is from the message below saying "(openib) BTL failed to initialize" that the code is probably running over tcp. To absolutely prove this you can specify to only use the openib, sm and self btls to eliminate the tcp btl. To do that you add the following to the mpirun line "-mca btl openib,sm,self". I believe with that specification the code will abort and not run to completion. What version of the OFED stack are you using? I wonder if srq is supported on your system or not? --td Allen Barnett wrote: Hi: A customer is attempting to run our OpenMPI 1.4.2-based application on a cluster of machines running RHEL4 with the standard OFED stack. The HCAs are identified as: 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) ibv_devinfo says that one port on the HCAs is active but the other is down: hca_id: mthca0 fw_ver: 3.0.2 node_guid: 0006:6a00:9800:4c78 sys_image_guid: 0006:6a00:9800:4c78 vendor_id: 0x066a vendor_part_id: 23108 hw_ver: 0xA1 phys_port_cnt: 2 port: 1 state: active (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 26 port_lmc: 0x00 port: 2 state: down (1) max_mtu:2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 When the OMPI application is run, it prints the error message: The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: machine001.lan OMPI source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 Function:ibv_create_
Re: [OMPI users] Implementing a new BTL module in MCA
You can find the template for a BTL in ompi/mca/btl/template (You will find this on the subversion trunk). Copy and rename the folder/files. Use this as a starting point. For details on creating a new component (such as a new BTL) look here - https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent The following document might also be useful - http://www.open-mpi.org/papers/trinity-btl-2009/xenmpi_report.pdf Regards --Nysal On Tue, Aug 3, 2010 at 5:45 PM, Simone Pellegrini < spellegr...@dps.uibk.ac.at> wrote: > Deal all, > I need to implement an MPI layer on top of a message passing library which > is currently used in a particular device where I have to run MPI programs ( > very vague, I know :) ). > > Instead of reinventing the wheel, my idea was to reuse most of the Open MPI > implementation and just add a new layer to support my custom device. I guess > that extending the Byte Transfer Layer of the Modular Component Architecture > should make the job. Right? > > Anyway, before starting wasting my time looking for documentation I wanted > to have some pointers to documentation regarding extension of Open MPI. > Which are the interfaces I have to extend? Is there any "hello world" > example on how to do it? > > many thanks, Simone > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >