Hi, I am trying to use the latest release of v1.3 to test with BLCR
however i just noticed that sometime after 1.3a1r18423 the standard
mpich sample code (cpi.c) stopped working on our rel4 based myrinet
gm clusters which raises some concern.

Please find attached: gm_board_info.out, ompi_info--all.out,
ompi_info--param-btl-gm.out and config-1.4a1r18743.log bundled
in mpi-output.tar.gz for your analysis.

Below shows the sample code runs with 1.3a1r18423, but crashes with
1.3a1r18740 and further crashes with all snapshots greater than
1.3a1r18423 i have tested.  Both gnu 4.2.1 and 4.2.2 were used to
build openmpi and yield the same results.  I have not tried 4.2.4
yet, but could next if this looks like a compiler related issue ?
Note also the code runs with both 1.2.6 and 1.2.7rc1 but fails in
all 1.4 releases of the devel trunk.

I also tried the code on a debian myrinet/mx cluster and in contrast
it runs without errors for all 1.3 and 1.4 releases, suggesting the
problem is related to some compatibility issue between open mpi and
our rel4 clusters software stack. Or possibly some parameters simply
need to be applied to the mpirun command that i am not aware of ?

1.3a1r18423
-----------
# /opt/testing/openmpi/1.3a1r18423/bin/mpicc cpi.c
# /opt/testing/openmpi/1.3a1r18423/bin/mpirun -np 12 --host \
ic-bru25,ic-bru27,ic-bru29 a.out
Process 4 of 12 is on bru27
Process 7 of 12 is on bru27
Process 10 of 12 is on bru27
Process 5 of 12 is on bru29
Process 8 of 12 is on bru29
Process 11 of 12 is on bru29
Process 0 of 12 is on bru25
Process 3 of 12 is on bru25
Process 9 of 12 is on bru25
Process 2 of 12 is on bru29
Process 1 of 12 is on bru27
Process 6 of 12 is on bru25
pi is approximately 3.1415926544231252, Error is 0.0000000008333321
wall clock time = 0.013029

1.3a1r18740
-----------
# /opt/testing/openmpi/1.3a1r18740/bin/mpicc cpi.c
# /opt/testing/openmpi/1.3a1r18740/bin/mpirun -np 12 --host \
ic-bru25,ic-bru27,ic-bru29 a.out
Process 5 of 12 is on bru29
Process 8 of 12 is on bru29
Process 11 of 12 is on bru29
Process 1 of 12 is on bru27
Process 4 of 12 is on bru27
Process 7 of 12 is on bru27
Process 10 of 12 is on bru27
Process 0 of 12 is on bru25
[bru25:11224] *** Process received signal ***
[bru25:11224] Signal: Segmentation fault (11)
[bru25:11224] Signal code: Address not mapped (1)
[bru25:11224] Failing at address: 0x9
Process 3 of 12 is on bru25
Process 6 of 12 is on bru25
Process 9 of 12 is on bru25
[bru25:11224] [ 0] /lib64/tls/libpthread.so.0 [0x38c090c420]
[bru25:11224] [ 1] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_btl_gm.so [0x2a9707ffb9] [bru25:11224] [ 2] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a96d71c1d] [bru25:11224] [ 3] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a96d66753] [bru25:11224] [ 4] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c7cb1c] [bru25:11224] [ 5] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c7ce27] [bru25:11224] [ 6] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c72eec] [bru25:11224] [ 7] /opt/testing/openmpi/1.3a1r18740/lib/libmpi.so.0(PMPI_Bcast+0x13e) [0x2a9559f05e]
[bru25:11224] [ 8] a.out(main+0xd6) [0x400d0f]
[bru25:11224] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x32ce51c4bb]
[bru25:11224] [10] a.out [0x400b7a]
[bru25:11224] *** End of error message ***
Process 2 of 12 is on bru29
[bru34:29907] -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 11224 on node ic-bru25 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Any help much appreciated!
-Doug

Attachment: mpi-output.tar.gz
Description: GNU Zip compressed data

Reply via email to