[OMPI users] OpenIB Error in ibv_create_srq
Hi: A customer is attempting to run our OpenMPI 1.4.2-based application on a cluster of machines running RHEL4 with the standard OFED stack. The HCAs are identified as: 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) ibv_devinfo says that one port on the HCAs is active but the other is down: hca_id: mthca0 fw_ver: 3.0.2 node_guid: 0006:6a00:9800:4c78 sys_image_guid: 0006:6a00:9800:4c78 vendor_id: 0x066a vendor_part_id: 23108 hw_ver: 0xA1 phys_port_cnt: 2 port: 1 state: active (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 26 port_lmc: 0x00 port: 2 state: down (1) max_mtu:2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 When the OMPI application is run, it prints the error message: The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: machine001.lan OMPI source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 Function:ibv_create_srq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. The full log of a run with "btl_openib_verbose 1" is attached. My application appears to run to completion, but I can't tell if it's just running on TCP and not using the IB hardware. I would appreciate any suggestions on how to proceed to fix this error. Thanks, Allen -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com openib.listing.gz Description: GNU Zip compressed data
Re: [OMPI users] OpenIB Error in ibv_create_srq
Hi Terry: It is indeed the case that the openib BTL has not been initialized. I ran with your tcp-disabled MCA option and it aborted in MPI_Init. The OFED stack is what's included in RHEL4. It appears to be made up of the RPMs: openib-1.4-1.el4 opensm-3.2.5-1.el4 libibverbs-1.1.2-1.el4 How can I determine if srq is supported? Is there an MCA option to defeat it? (Our in-house cluster has more recent Mellanox IB hardware and is running this same IB stack and ompi 1.4.2 works OK, so I suspect srq is supported by the OpenFabrics stack. Perhaps.) Thanks, Allen On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: > My guess is from the message below saying "(openib) BTL failed to > initialize" that the code is probably running over tcp. To > absolutely prove this you can specify to only use the openib, sm and > self btls to eliminate the tcp btl. To do that you add the following > to the mpirun line "-mca btl openib,sm,self". I believe with that > specification the code will abort and not run to completion. > > What version of the OFED stack are you using? I wonder if srq is > supported on your system or not? > > --td > > Allen Barnett wrote: > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application > > on a cluster of machines running RHEL4 with the standard OFED stack. The > > HCAs are identified as: > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > > > ibv_devinfo says that one port on the HCAs is active but the other is > > down: > > > > hca_id: mthca0 > > fw_ver: 3.0.2 > > node_guid: 0006:6a00:9800:4c78 > > sys_image_guid: 0006:6a00:9800:4c78 > > vendor_id: 0x066a > > vendor_part_id: 23108 > > hw_ver: 0xA1 > > phys_port_cnt: 2 > > port: 1 > > state: active (4) > > max_mtu:2048 (4) > > active_mtu: 2048 (4) > > sm_lid: 1 > > port_lid: 26 > > port_lmc: 0x00 > > > > port: 2 > > state: down (1) > > max_mtu:2048 (4) > > active_mtu: 512 (2) > > sm_lid: 0 > > port_lid: 0 > > port_lmc: 0x00 > > > > > > When the OMPI application is run, it prints the error message: > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > create an internal queue. This typically indicates a failed > > OpenFabrics installation, faulty hardware, or that Open MPI is > > attempting to use a feature that is not supported on your hardware > > (i.e., is a shared receive queue specified in the > > btl_openib_receive_queues MCA parameter with a device that does not > > support it?). The failure occured here: > > > > Local host: machine001.lan > > OMPI > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 > > Function:ibv_create_srq() > > Error: Invalid argument (errno=22) > > Device: mthca0 > > > > You may need to consult with your system administrator to get this > > problem fixed. > > > > > > The full log of a run with "btl_openib_verbose 1" is attached. My > > application appears to run to completion, but I can't tell if it's just > > running on TCP and not using the IB hardware. > > > > I would appreciate any suggestions on how to proceed to fix this error. > > > > Thanks, > > Allen > -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com
Re: [OMPI users] OpenIB Error in ibv_create_srq
Hi: In response to my own question, by studying the file mca-btl-openib-device-params.ini, I discovered that this option in OMPI-1.4.2: -mca btl_openib_receive_queues P,65536,256,192,128 was sufficient to prevent OMPI from trying to create shared receive queues and allowed my application to run to completion using the IB hardware. I guess my question now is: What do these numbers mean? Presumably the size (or counts?) of buffers to allocate? Are there limits or a way to tune these values? Thanks, Allen On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote: > Hi Terry: > It is indeed the case that the openib BTL has not been initialized. I > ran with your tcp-disabled MCA option and it aborted in MPI_Init. > > The OFED stack is what's included in RHEL4. It appears to be made up of > the RPMs: > openib-1.4-1.el4 > opensm-3.2.5-1.el4 > libibverbs-1.1.2-1.el4 > > How can I determine if srq is supported? Is there an MCA option to > defeat it? (Our in-house cluster has more recent Mellanox IB hardware > and is running this same IB stack and ompi 1.4.2 works OK, so I suspect > srq is supported by the OpenFabrics stack. Perhaps.) > > Thanks, > Allen > > On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: > > My guess is from the message below saying "(openib) BTL failed to > > initialize" that the code is probably running over tcp. To > > absolutely prove this you can specify to only use the openib, sm and > > self btls to eliminate the tcp btl. To do that you add the following > > to the mpirun line "-mca btl openib,sm,self". I believe with that > > specification the code will abort and not run to completion. > > > > What version of the OFED stack are you using? I wonder if srq is > > supported on your system or not? > > > > --td > > > > Allen Barnett wrote: > > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application > > > on a cluster of machines running RHEL4 with the standard OFED stack. The > > > HCAs are identified as: > > > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > > > > > ibv_devinfo says that one port on the HCAs is active but the other is > > > down: > > > > > > hca_id: mthca0 > > > fw_ver: 3.0.2 > > > node_guid: 0006:6a00:9800:4c78 > > > sys_image_guid: 0006:6a00:9800:4c78 > > > vendor_id: 0x066a > > > vendor_part_id: 23108 > > > hw_ver: 0xA1 > > > phys_port_cnt: 2 > > > port: 1 > > > state: active (4) > > > max_mtu:2048 (4) > > > active_mtu: 2048 (4) > > > sm_lid: 1 > > > port_lid: 26 > > > port_lmc: 0x00 > > > > > > port: 2 > > > state: down (1) > > > max_mtu:2048 (4) > > > active_mtu: 512 (2) > > > sm_lid: 0 > > > port_lid: 0 > > > port_lmc: 0x00 > > > > > > > > > When the OMPI application is run, it prints the error message: > > > > > > > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > create an internal queue. This typically indicates a failed > > > OpenFabrics installation, faulty hardware, or that Open MPI is > > > attempting to use a feature that is not supported on your hardware > > > (i.e., is a shared receive queue specified in the > > > btl_openib_receive_queues MCA parameter with a device that does not > > > support it?). The failure occured here: > > > > > > Local host: machine001.lan > > > OMPI > > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 > > > Function:ibv_create_srq() > > > Error: Invalid argument (errno=22) > > > Device: mthca0 > > > > > > You may need to consult with your system administrator to get this > > > problem fixed. > > > > > > > > > The full log of a run with "btl_openib_verbose 1" is attached. My > > > application appears to run to completion, but I can't tell if it's just > > > running on TCP and not using the IB hardware. > > > > > > I would appreciate any suggestions on how to proceed to fix this error. > > > > > > Thanks, > > > Allen > > > -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930
Re: [OMPI users] OpenIB Error in ibv_create_srq
Thanks for the pointer! Do you know if these sizes are dependent on the hardware? Thanks, Allen On Tue, 2010-08-03 at 10:29 -0400, Terry Dontje wrote: > Sorry, I didn't see your prior question glad you found the > btl_openib_receive_queues parameter. There is not a faq entry for > this but I found the following in the openib btl help file that spells > out the parameters when using Per-peer receive queue (ie receive queue > setting with "P" as the first argument). > > Per-peer receive queues require between 2 and 5 parameters: > > 1. Buffer size in bytes (mandatory) > 2. Number of buffers (mandatory) > 3. Low buffer count watermark (optional; defaults to (num_buffers / > 2)) > 4. Credit window size (optional; defaults to (low_watermark / 2)) > 5. Number of buffers reserved for credit messages (optional; > defaults to (num_buffers*2-1)/credit_window) > > Example: P,128,256,128,16 > - 128 byte buffers > - 256 buffers to receive incoming MPI messages > - When the number of available buffers reaches 128, re-post 128 more > buffers to reach a total of 256 > - If the number of available credits reaches 16, send an explicit > credit message to the sender > - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are > reserved for explicit credit messages > > --td > Allen Barnett wrote: > > Hi: In response to my own question, by studying the file > > mca-btl-openib-device-params.ini, I discovered that this option in > > OMPI-1.4.2: > > > > -mca btl_openib_receive_queues P,65536,256,192,128 > > > > was sufficient to prevent OMPI from trying to create shared receive > > queues and allowed my application to run to completion using the IB > > hardware. > > > > I guess my question now is: What do these numbers mean? Presumably the > > size (or counts?) of buffers to allocate? Are there limits or a way to > > tune these values? > > > > Thanks, > > Allen > > > > On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote: > > > > > Hi Terry: > > > It is indeed the case that the openib BTL has not been initialized. I > > > ran with your tcp-disabled MCA option and it aborted in MPI_Init. > > > > > > The OFED stack is what's included in RHEL4. It appears to be made up of > > > the RPMs: > > > openib-1.4-1.el4 > > > opensm-3.2.5-1.el4 > > > libibverbs-1.1.2-1.el4 > > > > > > How can I determine if srq is supported? Is there an MCA option to > > > defeat it? (Our in-house cluster has more recent Mellanox IB hardware > > > and is running this same IB stack and ompi 1.4.2 works OK, so I suspect > > > srq is supported by the OpenFabrics stack. Perhaps.) > > > > > > Thanks, > > > Allen > > > > > > On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: > > > > > > > My guess is from the message below saying "(openib) BTL failed to > > > > initialize" that the code is probably running over tcp. To > > > > absolutely prove this you can specify to only use the openib, sm and > > > > self btls to eliminate the tcp btl. To do that you add the following > > > > to the mpirun line "-mca btl openib,sm,self". I believe with that > > > > specification the code will abort and not run to completion. > > > > > > > > What version of the OFED stack are you using? I wonder if srq is > > > > supported on your system or not? > > > > > > > > --td > > > > > > > > Allen Barnett wrote: > > > > > > > > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based > > > > > application > > > > > on a cluster of machines running RHEL4 with the standard OFED stack. > > > > > The > > > > > HCAs are identified as: > > > > > > > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > > > > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > > > > > > > > > ibv_devinfo says that one port on the HCAs is active but the other is > > > > > down: > > > > > > > > > > hca_id: mthca0 > > > > > fw_ver: 3.0.2 > > > > > node_guid: 0006:6a00:9800:4c78 > > > > > sys_image_guid: 0006:6a00:9800:4c78 > > > > > v
[OMPI users] Thanks
I just wanted to say "Thank You!" to the OpenMPI developers for the OPAL_PREFIX option :-) This has proved very helpful in getting my customers up and running with the least amount of effort on their part. I really appreciate it. Thanks, Allen -- Allen Barnett Transpire, Inc. e-mail: al...@transpireinc.com Ph: 518-887-2930
[OMPI users] Possible Memcpy bug in MPI_Comm_split
Hi: I was running my OpenMPI 1.2.3 application under Valgrind and I observed this error message: ==14322== Source and destination overlap in memcpy(0x41F5BD0, 0x41F5BD8, 16) ==14322==at 0x49070AD: memcpy (mc_replace_strmem.c:116) ==14322==by 0x4A45CF4: ompi_ddt_copy_content_same_ddt (in /home/scratch/DMP/RHEL4-GCC4/lib/libmpi.so.0.0.0) ==14322==by 0x7A6C386: ompi_coll_tuned_allgather_intra_bruck (in /home/scratch/DMP/RHEL4-GCC4/lib/openmpi/mca_coll_tuned.so) ==14322==by 0x4A29FFE: ompi_comm_split (in /home/scratch/DMP/RHEL4-GCC4/lib/libmpi.so.0.0.0) ==14322==by 0x4A4E322: MPI_Comm_split (in /home/scratch/DMP/RHEL4-GCC4/lib/libmpi.so.0.0.0) ==14322==by 0x400A26: main (in /home/scratch/DMP/severian_tests/ompi/a.out) Attached is a reduced code example. I run it like: mpirun -np 3 valgrind ./a.out I only see this error if there are an odd number of processes! I don't know if this is really a problem or not, though. My OMPI application seems to work OK. However, the linux man page for memcpy says overlapping range copying is undefined. Other details: x86_64 (one box, two dual-core opterons), RHEL 4.5, OpenMPI-1.2.3 compiled with the RHEL-supplied GCC 4 (gcc4 (GCC) 4.1.1 20070105 (Red Hat 4.1.1-53)), valgrind 3.2.3. Thanks, Allen -- Allen Barnett Transpire, Inc. e-mail: al...@transpireinc.com Ph: 518-887-2930 #include #include #include "mpi.h" int main ( int argc, char* argv[] ) { int rank, size, c; MPI_Comm* comms; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); comms = malloc( size * sizeof(MPI_Comm) ); for ( c = 0; c < size; c++ ) { int color = MPI_UNDEFINED; if ( c == rank ) color = 0; MPI_Comm_split( MPI_COMM_WORLD, color, 0, &comms[c] ); } MPI_Finalize(); free( comms ); return 0; } info.bz2 Description: application/bzip
Re: [OMPI users] Problem with X forwarding
[14] ./DistributedData [0x819dc21] > > [vrc1:27394] *** End of error message *** > > mpirun noticed that job rank 0 with PID 27394 on node exited on > > signal 11 (Segmentation fault). > > > > > > Maybe I am not doing the xforwading properly, but has anyone ever > > encountered the same problem, it works fine on one pc, and I read > > the mailing list but I just don't know if my prob is similiar to > > their, I even tried changing the DISPLAY env > > > > > > This is what I want to do > > > > my mpirun should run on 2 machines ( A and B ) and I should be able > > to view the output ( on my PC ), > > are there any specfic commands to use. > > > > > > > > > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Allen Barnett Transpire, Inc. e-mail: al...@transpireinc.com Ph: 518-887-2930
[OMPI users] Memchecker and Wait
Hi: I'm trying to use the memchecker/valgrind capability of OpenMPI 1.3.3 to help debug my MPI application. I noticed a rather odd thing: After Waiting on a Recv Request, valgrind declares my receive buffer as invalid memory. Is this just a fluke of valgrind, or is OMPI doing something internally? This is on a 64-bit RHEL 5 system using GCC 4.3.2 and Valgrind 3.4.1. Here is an example: -- #include #include #include "mpi.h" int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if ( size != 2 ) { if ( rank == 0 ) printf("Please run with 2 processes.\n"); MPI_Finalize(); return 1; } if (rank == 0) { char buffer_in[100]; MPI_Request req_in; MPI_Status status; memset( buffer_in, 1, sizeof(buffer_in) ); MPI_Recv_init( buffer_in, 100, MPI_CHAR, 1, 123, MPI_COMM_WORLD, &req_in ); MPI_Start( &req_in ); printf( "Before wait: %p: %d\n", buffer_in, buffer_in[3] ); printf( "Before wait: %p: %d\n", buffer_in, buffer_in[4] ); MPI_Wait( &req_in, &status ); printf( "After wait: %p: %d\n", buffer_in, buffer_in[3] ); printf( "After wait: %p: %d\n", buffer_in, buffer_in[4] ); MPI_Request_free( &req_in ); } else { char buffer_out[100]; memset( buffer_out, 2, sizeof(buffer_out) ); MPI_Send( buffer_out, 100, MPI_CHAR, 0, 123, MPI_COMM_WORLD ); } MPI_Finalize(); return 0; } -- Doing "mpirun -np 2 -mca btl ^sm valgrind ./a.out" yields: Before wait: 0x7ff0003b0: 1 Before wait: 0x7ff0003b0: 1 ==15487== ==15487== Invalid read of size 1 ==15487==at 0x400C6B: main (waittest.c:30) ==15487== Address 0x7ff0003b3 is on thread 1's stack After wait: 0x7ff0003b0: 2 ==15487== ==15487== Invalid read of size 1 ==15487==at 0x400C8B: main (waittest.c:31) ==15487== Address 0x7ff0003b4 is on thread 1's stack After wait: 0x7ff0003b0: 2 Also, if I run this program with the shared memory BTL active, valgrind reports several "conditional jump or move depends on uninitialized value"s in the SM BTL and about 24k lost bytes at the end (mostly from allocations in MPI_Init). Thanks, Allen -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com Skype: allenbarnett
Re: [OMPI users] Memchecker and Wait
Hi Shiqing: That is very clever to invalidate the buffer memory until the comm completes! However, I guess I'm still confused by my results. Lines 30 and 31 identified by valgrind are the lines after the Wait, and, if I comment out the prints before the Wait, I still get the valgrind errors on the "After wait" prints. If I add prints after the Request_free calls, then I no longer receive the valgrind errors when accessing "buffer_in" from that point on. So, it appears that the buffer is marked invalid until the request is freed. Perhaps I don't understand the sequence of events in MPI. I thought the buffer was ok to use after the Wait, and requests could be safely recycled. Or maybe valgrind is pointing to the wrong lines, however the addresses which it reports as invalid are exactly those in the buffer which are being accessed in the post-Wait prints. Here is snippet of a more instrumented example program with line numbers. -- 25 MPI_Recv_init( buffer_in, 100, MPI_CHAR, 1, 123, MPI_COMM_WORLD, &req_in ); 26 printf( "Before start: %p: %d\n", &buffer_in[0], buffer_in[0] ); 27 printf( "Before start: %p: %d\n", &buffer_in[1], buffer_in[1] ); 28 MPI_Start( &req_in ); 29 printf( "Before wait: %p: %d\n", &buffer_in[2], buffer_in[2] ); 30 printf( "Before wait: %p: %d\n", &buffer_in[3], buffer_in[3] ); 31 MPI_Wait( &req_in, &status ); 32 printf( "After wait: %p: %d\n", &buffer_in[4], buffer_in[4] ); 33 printf( "After wait: %p: %d\n", &buffer_in[5], buffer_in[5] ); 34 MPI_Request_free( &req_in ); 35 printf( "After free: %p: %d\n", &buffer_in[6], buffer_in[6] ); 36 printf( "After free: %p: %d\n", &buffer_in[7], buffer_in[7] ); -- And the valgrind output Before start: 0x7ff0003c0: 1 Before start: 0x7ff0003c1: 1 Before wait: 0x7ff0003c2: 1 Before wait: 0x7ff0003c3: 1 ==17395== ==17395== Invalid read of size 1 ==17395==at 0x400CB7: main (waittest.c:32) ==17395== Address 0x7ff0003c4 is on thread 1's stack After wait: 0x7ff0003c4: 2 ==17395== ==17395== Invalid read of size 1 ==17395==at 0x400CDB: main (waittest.c:33) ==17395== Address 0x7ff0003c5 is on thread 1's stack After wait: 0x7ff0003c5: 2 After free: 0x7ff0003c6: 2 After free: 0x7ff0003c7: 2 Here valgrind is complaining about the prints on line 32 and 33 and the memory addresses are consistent with buffer_in[4] and buffer_in[5]. So, I'm still puzzled. Thanks, Allen On Wed, 2009-08-12 at 10:31 +0200, Shiqing Fan wrote: > Hi Allen, > > The invalid reads come from line 30 and 31 of your code, and I guess > they are the two 'printf's before MPI_Wait. > > In Open MPI, when memchecker is enabled, OMPI marks the receive buffer > as invalid internally, immediately after receive starts for MPI semantic > checks, in this case, it just warns the users that they are accessing > the receive buffer before the receive has finished, which is not allowed > according to the MPI standard. > > For a non-blocking receive, the communication only completes after > MPI_Wait is called. After that point, the user buffers are declared > valid again, that's why the 'printf's after MPI_Wait don't cause any > warnings from Valgrind. Hope this helps. :-) > > > Regards, > Shiqing > > > Allen Barnett wrote: > > Hi: > > I'm trying to use the memchecker/valgrind capability of OpenMPI 1.3.3 to > > help debug my MPI application. I noticed a rather odd thing: After > > Waiting on a Recv Request, valgrind declares my receive buffer as > > invalid memory. Is this just a fluke of valgrind, or is OMPI doing > > something internally? > > > > This is on a 64-bit RHEL 5 system using GCC 4.3.2 and Valgrind 3.4.1. > > > > Here is an example: > > -- > > #include > > #include > > #include "mpi.h" > > > > int main(int argc, char *argv[]) > > { > > int rank, size; > > > > MPI_Init(&argc, &argv); > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > > > if ( size != 2 ) { > > if ( rank == 0 ) > > printf("Please run with 2 processes.\n"); > > MPI_Finalize(); > > return 1; > > } > > > > if (rank == 0) { > > char buffer_in[100]; > > MPI_Request req_in; > > MPI_Status status; > > memset( buffer_in, 1, sizeof(buffer_in) ); > > MPI_Recv_init( buffer_in, 100, MPI_CHAR, 1, 123, MPI_COMM_WORLD, > > &req_in ); > &
[OMPI users] OpenMPI 1.3 Infiniband Hang
Hi: I recently tried to build my MPI application against OpenMPI 1.3.3. It worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs part way through. It does a fair amount of comm, but eventually it stops in a Send/Recv point-to-point exchange. If I turn off the openib btl, it runs to completion. Also, I built 1.3.3 with memchecker (which is very nice; thanks to everyone who worked on that!) and it runs to completion, even with openib active. Our cluster consists of dual dual-core opteron boxes with Mellanox MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396 Infiniscale-III switch. We're running RHEL 4.8 which appears to include OFED 1.4. I've built everything using GCC 4.3.2. Here is the output from ibv_devinfo. "ompi_info --all" is attached. $ ibv_devinfo hca_id: mthca0 fw_ver: 1.1.0 node_guid: 0002:c902:0024:3284 sys_image_guid: 0002:c902:0024:3287 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0xA0 board_id: MT_03B0140002 phys_port_cnt: 1 port: 1 state: active (4) max_mtu:2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 1 port_lmc: 0x00 I'd appreciate any tips for debugging this. Thanks, Allen -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930 ompinfo.gz Description: GNU Zip compressed data
Re: [OMPI users] OpenMPI 1.3 Infiniband Hang
Hi: Setting mpi_leave_pinned to 0 allows my application to run to completion when running with openib active. I realize that it's probably not going to help my application's performance, but since "ON" is the default, I'd like to understand what's happening. There's definitely a dependence on problem size: smaller problems run to completion while larger problems hang at different points in the code. Are there buffer sizes (or other BTL settings) I can adjust to understand my problem better? Thanks, Allen On Thu, 2009-08-13 at 10:11 +0300, Lenny Verkhovsky wrote: > Hi, > 1. > The Mellanox has a newer fw for those > HCAshttp://www.mellanox.com/content/pages.php?pg=firmware_table_IH3Lx > > I am not sure if it will help, but newer fw usually have some bug > fixes. > > 2. > try to disable leave_pinned during the run. It's on by default in > 1.3.3 > > Lenny. > > On Thu, Aug 13, 2009 at 5:12 AM, Allen Barnett > wrote: > Hi: > I recently tried to build my MPI application against OpenMPI > 1.3.3. It > worked fine with OMPI 1.2.9, but with OMPI 1.3.3, it hangs > part way > through. It does a fair amount of comm, but eventually it > stops in a > Send/Recv point-to-point exchange. If I turn off the openib > btl, it runs > to completion. Also, I built 1.3.3 with memchecker (which is > very nice; > thanks to everyone who worked on that!) and it runs to > completion, even > with openib active. > > Our cluster consists of dual dual-core opteron boxes with > Mellanox > MT25204 (InfiniHost III Lx) HCAs and a Mellanox MT47396 > Infiniscale-III > switch. We're running RHEL 4.8 which appears to include OFED > 1.4. I've > built everything using GCC 4.3.2. Here is the output from > ibv_devinfo. > "ompi_info --all" is attached. > $ ibv_devinfo > hca_id: mthca0 >fw_ver: 1.1.0 >node_guid: 0002:c902:0024:3284 >sys_image_guid: 0002:c902:0024:3287 >vendor_id: 0x02c9 >vendor_part_id: 25204 >hw_ver: 0xA0 >board_id: MT_03B0140002 >phys_port_cnt: 1 >port: 1 >state: active (4) >max_mtu:2048 (4) >active_mtu: 2048 (4) >sm_lid: 1 >port_lid: 1 >port_lmc: 0x00 > > I'd appreciate any tips for debugging this. > Thanks, > Allen
[OMPI users] Bus Error in ompi_free_list_grow
Hi: A customer is running our parallel application on an SGI Altix machine. They compiled OMPI 1.2.8 themselves. The Altix uses IB interfaces and they recently upgraded to OFED 1.3 (in SGI Propack 6). They are receiving a bus error in ompi_free_list_grow: [r1i0n0:01321] *** Process received signal *** [r1i0n0:01321] Signal: Bus error (7) [r1i0n0:01321] Signal code: (2) [r1i0n0:01321] Failing at address: 0x2b04ba07c4a0 [r1i0n0:01321] [ 0] /lib64/libpthread.so.0 [0x2b04b00cfc00] [r1i0n0:01321] [ 1] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(ompi_free_list_grow+0x14a) [0x2b04af7dc058] [r1i0n0:01321] [ 2] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_btl_sm.so(mca_btl_sm_alloc+0x321) [0x2b04b38c8e35] [r1i0n0:01321] [ 3] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x26d) [0x2b04b3378f91] [r1i0n0:01321] [ 4] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x546) [0x2b04b3370c7e] [r1i0n0:01321] [ 5] /usr/local/attila/severian-0.3.2-beta/lib/x86_64-Linux/libmpi.so.0(MPI_Send+0x28) [0x2b04af814098] Here is some more information about the machine: SGI Altix ICE 8200 cluster; each node has two quad core Xeons with 16GB SUSE Linux Enterprise Server 10 Service Pack 2 GNU C Library stable release version 2.4 (20080421) gcc (GCC) 4.1.2 20070115 (SUSE Linux) SGI Propack 6 (just upgraded from Propack 5 SP3: changed from OFED 1.2 to 1.3) The output from ompi_info is attached. I would appreciate any help debugging this. Thanks, Allen -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930 ompi_info.txt.bz2 Description: application/bzip
Re: [OMPI users] Handling output of processes
On Thu, 2009-01-22 at 06:33 -0700, Ralph Castain wrote: > If you need to do this with a prior releasewell, I'm afraid it > won't work. :-) As a quick hack for 1.2.x, I sometimes use this script to wrap my executable: --- #!/bin/sh # sompi.sh: Send each rank's output to a separate file. # Note use of undocumented OMPI 1.2.x environment variables! exec "$*" > "listing.$OMPI_MCA_ns_nds_num_procs.$OMP_MCA_ns_nds_vpid" --- Then do: $ mpirun -np 3 ~allen/bin/sompi.sh parallel_program As the processes run, you can "tail" the individual listing files to see what's happening. Of course, the working directory has to be writable and you have to find the machine and directory where the output is being redirected to, and so on... Allen > On Jan 22, 2009, at 1:58 AM, jody wrote: > > > Hi > > I have a small cluster consisting of 9 computers (8x2 CPUs, 1x4 CPUs). > > I would like to be able to observe the output of the processes > > separately during an mpirun. > > > > What i currently do is to apply the mpirun to a shell script which > > opens a xterm for each process, > > which then starts the actual application. > > > > This works, but is a bit complicated, e.g. finding the window you're > > interested in among 19 others. > > > > So i was wondering is there a possibility to capture the processes' > > outputs separately, so > > i can make an application in which i can switch between the different > > processor outputs? > > I could imagine that could be done by wrapper applications which > > redirect the output over a TCP > > socket to a server application. > > > > But perhaps there is an easier way, or something like this alread > > does exist? > > > > Thank You > > Jody > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930
Re: [OMPI users] Handling output of processes
On Sun, 2009-01-25 at 05:20 -0700, Ralph Castain wrote: > 2. redirect output of specified processes to files using the provided > filename appended with ".rank". You can do this for all ranks, or a > specified subset of them. A filename extension including both the comm size and the rank is helpful if, like me, you run several jobs out of the same directory with different numbers of processors. Say "listing.32.01" for rank 1, -np 32. (And, as always, padding numbers with zeros makes "ls" behave more sanely.) Thanks, Allen -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930
[OMPI users] Spawn and OpenFabrics
On Tue, 2009-05-19 at 08:29 -0400, Jeff Squyres wrote: > fork() support in OpenFabrics has always been dicey -- it can lead to > random behavior like this. Supposedly it works in a specific set of > circumstances, but I don't have a recent enough kernel on my machines > to test. > > It's best not to use calls to system() if they can be avoided. > Indeed, Open MPI v1.3.x will warn you if you create a child process > after MPI_INIT when using OpenFabrics networks. My C++ OMPI program uses system() to invoke an external mesh partitioner program after MPI_INIT is called. Sometimes (with frustrating randomness), on systems using OFED the system() call fails with EFAULT (Bad address). The linux kernel appears to feel that the execve() function is being passed a string which isn't in the process' address space. The exec string is constructed immediately before calling system() like this: std::stringstream ss; ss << "partitioner_program " << COMM_WORLD_SIZE; system( ss.str().c_str() ); Could this behavior related to this admonition? Also, would MPI_COMM_SPAWN suffer from the same difficulties? Thanks, Allen -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930
Re: [OMPI users] Spawn and OpenFabrics
On Tue, 2009-06-02 at 12:27 -0400, Jeff Squyres wrote: > On Jun 2, 2009, at 11:37 AM, Allen Barnett wrote: > > > std::stringstream ss; > > ss << "partitioner_program " << COMM_WORLD_SIZE; > > system( ss.str().c_str() ); > > > > You'd probably see the same problem even if you strdup'ed the c_str() > and system()'ed that. > > What kernel are you using? I've seen it myself on my generic opteron RHEL 4 cluster with kernel 2.6.9-78.0.22; I can't really figure out which version of OFED it uses (maybe 1.2?). A customer has reported it on an Altix system with SLES 10.2 and kernel 2.6.16.60 with a version of OFED 1.3. > Does OMPI say that it has IBV fork support? > ompi_info --param btl openib --parsable | grep have_fork_support My RHEL4 system reports: MCA btl: parameter "btl_openib_want_fork_support" (current value: "-1") MCA btl: information "btl_openib_have_fork_support" (value: "1") as does the build installed on the Altix system. > Be sure to also see > http://www.open-mpi.org/faq/?category=openfabrics#ofa-fork We're using OMPI 1.2.8. > > Also, would MPI_COMM_SPAWN suffer from the same difficulties? > > > > > It shouldn't; we proxy the launch of new commands off to mpirun / > OMPI's run-time system. Specifically: the new process(es) are not > POSIX children of the process(es) that called MPI_COMM_SPAWN. Is a program started with MPI_COMM_SPAWN required to call MPI_INIT? I guess what I'm asking is if I will have to make my partitioner an OpenMPI program as well? Thanks, Allen -- Allen Barnett E-Mail: al...@transpireinc.com Skype: allenbarnett
Re: [OMPI users] Spawn and OpenFabrics
OK. I appreciate the suggestion and will definitely try it out. Thanks, Allen On Fri, 2009-06-05 at 10:14 -0400, Jeff Squyres wrote: > On Jun 2, 2009, at 3:26 PM, Allen Barnett wrote: > > I > > guess what I'm asking is if I will have to make my partitioner an > > OpenMPI program as well? > > > > > If you use MPI_COMM_SPAWN with the 1.2 series, yes. > > Another less attractive but functional solution would be to do what I > did for the new command notifier due in the OMPI v1.5 series > ("notifier" = subsystem to notify external agents when OMPI detects > something wrong, like write to the syslog, send an email, write to a > sysadmin mysql db, etc., "command" = plugin that simply forks and runs > whatever command you want). During MPI_INIT, the fork notifier pre- > forks a dummy process. This dummy process then waits for commands via > a pipe. When the parent (MPI process itself) wants to fork a child, > it sends the argv to exec down the pipe and has the child process > actually do the fork and exec. > > Proxying all the fork requests through a secondary process like this > avoids all the problems with registered memory in the child process. > This is icky, but it is an unfortunately necessity for OS-bypass/ > registration-based networks like OpenFabrics. > > In your case, you'd want to pre-fork before calling MPI_INIT. But the > rest of the technique is pretty much the same. > > Have a look at the code in this tree if it helps: > > > https://svn.open-mpi.org/trac/ompi/browser/trunk/orte/mca/notifier/command
[OMPI users] Hang with Mixed Machines
Hi: I have a "cluster" consisting of a dual Opteron system (called a.lan) and a dual AthlonMP system (b.lan). Both systems are running Red Hat Enterprise Linux 4. The opteron system runs in 64-bit mode; the AthlonMP in 32-bit. I can't seem to make OpenMPI work between these two machines. I've tried 1.1.2, 1.1.3b1, and 1.2b1 and they all exhibit the same behavior, namely that Bcasts won't complete. Here's my simple.cpp test program: #include #include "mpi.h" int main ( int argc, char* argv[] ) { MPI_Init( &argc, &argv ); char hostname[256]; int hostname_size = sizeof(hostname); MPI_Get_processor_name( hostname, &hostname_size ); std::cout << "Running on " << hostname << std::endl; std::cout << hostname << " in to Bcast" << std::endl; double a = 3.14159; MPI_Bcast( &a, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); std::cout << hostname << " out of Bcast" << std::endl; MPI_Finalize(); return 0; } I compile this and run it with "mpirun --host a.lan --host b.lan simple". Generally, if I'm on a.lan, I see: Running on a.lan a.lan in to Bcast Running on b.lan a.lan out of Bcast b.lan in to Bcast If I launch from b.lan, then the reverse happens (i.e., it exits the Bcast on b.lan, but never exits Bcast on a.lan and a.lan uses 100% cpu). On the other hand, I have another 32-bit system (just a plain Athlon running RHEL 4, called c.lan). My test program runs fine between b.lan and c.lan. I feel like I must be making an incredibly obvious mistake. Thanks, Allen -- Allen Barnett Transpire, Inc. E-Mail: al...@transpireinc.com Ph: 518-887-2930
[OMPI users] Relocating an Installation
Hi: There was a thread back in November started by Patrick Jessee about relocating an installation after it was built (the subject was: removing hard-coded paths from OpenMPI shared libraries). I guess I'm in the same boat now. I would like to distribute our OpenMPI-based parallel solver; but I can't really dictate where a user will install our software. Has any one succeeded in building a version of OpenMPI which can be relocated? Thanks, Allen -- Allen Barnett Transpire, Inc. E-Mail: al...@transpireinc.com Ph: 518-887-2930
Re: [OMPI users] Relocating an Installation
Upon reflecting on this more, I guess I see two issues. First, there's the issue of allowing the user to install our software on the system where ever they like. Some users may want to install it in their home directories, others may have a sys admin install it in a common location. This seems like a substantial reason to allow an OMPI installation to be relocated. So, I would say this was a very important capability. On the other hand, I don't have access to all the third party headers and libraries which are necessary to build some of the more interesting OMPI modules, such as the Infiniband and Myrinet drivers and many of the batch scheduling drivers (tm? LoadLeveler? PORTALS? Xgrid? [I'm not sure what these are]. And maybe a related question: One customer uses NQS (NQE?); can this be supported?) So, I would expect the user may want to compile at least some of OMPI himself (or herself) in order to activate these modules. Thus, perhaps I should supply a partially built installation which completes compilation as part of the installation process? This seems somewhat impractical since it would require compilers, headers and libraries, etc, on every machine on which our software is installed. I also don't know what distribution restrictions are placed on all the 3rd party software OMPI can link against. This may limit what can be redistributed with our product. So, I guess I'm open to suggestions on how best to distribute our software. Being able to relocate an installation and being able to build specific modules at installation time would appear to be very helpful capabilities. Many thanks, Allen On Fri, 2006-12-15 at 19:45 -0500, Jeff Squyres wrote: > Greetings Allen. > > This problem has not yet been resolved, but I'm quite sure we have an > open ticket about this on our bug tracker. I just replied to Patrick > on-list about a related issue (his use of --prefix); I'd have to > think about this a bit more, but a solution proposed by one of the > other OMPI developers in internal conversations may fix both issues. > It only hasn't been coded up because we didn't prioritize it high. > > So my question to you is -- how high of a priority is this for you? > Part of what makes it into each OMPI release is driven by what users > want/need, so input like this helps us prioritize the work. > > Thanks! > > > On Dec 13, 2006, at 10:37 AM, Allen Barnett wrote: > > > There was a thread back in November started by Patrick Jessee about > > relocating an installation after it was built (the subject was: > > removing > > hard-coded paths from OpenMPI shared libraries). I guess I'm in the > > same > > boat now. I would like to distribute our OpenMPI-based parallel > > solver; > > but I can't really dictate where a user will install our software. Has > > any one succeeded in building a version of OpenMPI which can be > > relocated? > -- Allen Barnett Transpire, Inc. E-Mail: al...@transpireinc.com Ph: 518-887-2930