Using todays SVN (1.3a1r17234) and building in the context of OFED 1.2.5.4 installed and it works!
Regards, Mostyn On Thu, 24 Jan 2008, Mostyn Lewis wrote:
Hello, I have a very simple MPI program hanging in MPI_Reduce using the openmpi-1.2.4-1 as supplied with OFED 1.2.5.4 (running this too). It works with same hardware using the supplied mvapich (mvapich-0.9.9). The hardware is a Mellanox Technologies MT25418 [ConnectX IB DDR] (rev a0) HCA (SUN/voltaire) and the switch is a voltaire ISR9024D (running at DDR rate). ------------------------------------------------------------------------------ Switch software/firmware is: ISR9024D-2c0c> version show ISR 9024 version: 3.4.5 date: Oct 09 2007 11:46:00 AM build Id:467 ISR9024D-2c0c> module-firmware show Anafa self address: lid 1 firmware 1.0.0 gid 0xfe800000000000000008f10400412c0c ------------------------------------------------------------------------------ HCA info: /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/bin$ ./ibv_devinfo hca_id: mlx4_0 fw_ver: 2.2.000 node_guid: 0003:ba00:0100:5cf0 sys_image_guid: 0003:ba00:0100:5cf3 vendor_id: 0x03ba vendor_part_id: 25418 hw_ver: 0xA0 board_id: SUN0060000001 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 10 port_lmc: 0x00 port: 2 state: PORT_DOWN (1) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 0 port_lid: 0 port_lmc: 0x00 ./ibstatus Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0003:ba00:0100:5cf1 base lid: 0xa sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) The program is an old LAM test (cpi.c) ------------------------------------------------------------------------------ #include <stdio.h> #include <sys/types.h> #include <unistd.h> #include <math.h> #include <mpi.h> /* Constant for how many values we'll estimate */ #define NUM_ITERS 1000 /* Prototype the function that we'll use below. */ static double f(double); int main(int argc, char *argv[]) { int iter, rank, size, i; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x; double startwtime = 0.0, endwtime; int namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d of %d on %s\n", rank, size, processor_name); for (iter = 2; iter < NUM_ITERS; ++iter) { h = 1.0 / (double) iter; sum = 0.0; for (i = rank + 1; i <= iter; i += size) { x = h * ((double) i - 0.5); sum += f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); } MPI_Finalize(); return 0; } static double f(double a) { return (4.0 / (1.0 + a * a)); } ------------------------------------------------------------------------------ The gcc openmpi hang from gdb looks like: (gdb) where #0 0x00002b60d54428e5 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00002b60d8705aec in mlx4_poll_cq (ibcq=0x5b0bf0, ne=1, wc=0x7fffd6051390) at src/cq.c:334 #2 0x00002b60d7c865bc in btl_openib_component_progress () at /tmp/OFED-1.2.5.4/OFED/tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/include/infiniband/verbs .h:883 #3 0x00002b60d7b7925a in mca_bml_r2_progress () at bml_r2.c:106 #4 0x00002b60d4e6d11a in opal_progress () at runtime/opal_progress.c:288 #5 0x00002b60d7a6b8b8 in mca_pml_ob1_recv (addr=0x7fffd60517c8, count=1, datatype=0x501660, src=8, tag=-21, comm=<value optimized out>, status=0x0) at ../../../../opal/threads/condition.h:81 #6 0x00002b60d84e3cfa in ompi_coll_tuned_reduce_intra_basic_linear (sbuf=0x7fffd60517d0, rbuf=0x7fffd60517c8, count=1, dtype=0x501660, op=0x5012f0, root=<value optimized out>, comm=0x5014a0) at coll_tuned_reduce.c:385 #7 0x00002b60d4bcd32f in PMPI_Reduce (sendbuf=0x7fffd60517d0, recvbuf=0x7fffd60517c8, count=1, datatype=0x501660, op=0x5012f0, root=0, comm=0x5014a0) at preduce.c:96 #8 0x0000000000400cee in main () A pgi compiled hang from gdb looks like: (gdb) where #0 0x00002ac216e408e5 in pthread_spin_lock () from /lib64/libpthread.so.0 #1 0x00002ac2177ceaec in mlx4_poll_cq (ibcq=0x5b52c0, ne=1, wc=0x7fff97255600) at src/cq.c:334 #2 0x00002ac216bf51c2 in ibv_poll_cq () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_btl _openib.so #3 0x00002ac216bf8182 in btl_openib_component_progress () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_btl _openib.so #4 0x00002ac216ae9b24 in mca_bml_r2_progress () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_bml _r2.so #5 0x00002ac213d60be4 in opal_progress () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/libopen-pal.so. 0 #6 0x00002ac2169d4f45 in opal_condition_wait () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_pml _ob1.so #7 0x00002ac2169d5a83 in mca_pml_ob1_recv () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_pml _ob1.so #8 0x00002ac2175a1e67 in ompi_coll_tuned_reduce_intra_basic_linear () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_col l_tuned.so #9 0x00002ac217597ca5 in ompi_coll_tuned_reduce_intra_dec_fixed () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/openmpi/mca_col l_tuned.so #10 0x00002ac213a07e38 in PMPI_Reduce () from /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/pgi/openmpi-1.2.4-1/lib64/libmpi.so.0 #11 0x0000000000402551 in main () ------------------------------------------------------------------------------ The openmpi_gcc script was: #!/bin/ksh set -x export PATH=/tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/gcc/openmpi-1.2.4-1/bin:$PATH PREFIX="--prefix /tools/ofed/1.2.5.4/suse_sles_10_1/x86_64/xeon/mpi/gcc/openmpi-1.2.4-1" MCA="-mca btl openib,self -mca btl_tcp_if_exclude lo,eth1 -mca oob_tcp_if_exclude lo,eth1" mpicc cpi.c mpirun $PREFIX $MCA -np 9 -machinefile ic48scali ./a.out Any ideas who may be the culprit in this hang? Regards, Mostyn _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users