Hi, I am trying to use the latest release of v1.3 to test with BLCR however i just noticed that sometime after 1.3a1r18423 the standard mpich sample code (cpi.c) stopped working on our rel4 based myrinet gm clusters which raises some concern.
Please find attached: gm_board_info.out, ompi_info--all.out, ompi_info--param-btl-gm.out and config-1.4a1r18743.log bundled in mpi-output.tar.gz for your analysis. Below shows the sample code runs with 1.3a1r18423, but crashes with 1.3a1r18740 and further crashes with all snapshots greater than 1.3a1r18423 i have tested. Both gnu 4.2.1 and 4.2.2 were used to build openmpi and yield the same results. I have not tried 4.2.4 yet, but could next if this looks like a compiler related issue ? Note also the code runs with both 1.2.6 and 1.2.7rc1 but fails in all 1.4 releases of the devel trunk. I also tried the code on a debian myrinet/mx cluster and in contrast it runs without errors for all 1.3 and 1.4 releases, suggesting the problem is related to some compatibility issue between open mpi and our rel4 clusters software stack. Or possibly some parameters simply need to be applied to the mpirun command that i am not aware of ? 1.3a1r18423 ----------- # /opt/testing/openmpi/1.3a1r18423/bin/mpicc cpi.c # /opt/testing/openmpi/1.3a1r18423/bin/mpirun -np 12 --host \ ic-bru25,ic-bru27,ic-bru29 a.out Process 4 of 12 is on bru27 Process 7 of 12 is on bru27 Process 10 of 12 is on bru27 Process 5 of 12 is on bru29 Process 8 of 12 is on bru29 Process 11 of 12 is on bru29 Process 0 of 12 is on bru25 Process 3 of 12 is on bru25 Process 9 of 12 is on bru25 Process 2 of 12 is on bru29 Process 1 of 12 is on bru27 Process 6 of 12 is on bru25 pi is approximately 3.1415926544231252, Error is 0.0000000008333321 wall clock time = 0.013029 1.3a1r18740 ----------- # /opt/testing/openmpi/1.3a1r18740/bin/mpicc cpi.c # /opt/testing/openmpi/1.3a1r18740/bin/mpirun -np 12 --host \ ic-bru25,ic-bru27,ic-bru29 a.out Process 5 of 12 is on bru29 Process 8 of 12 is on bru29 Process 11 of 12 is on bru29 Process 1 of 12 is on bru27 Process 4 of 12 is on bru27 Process 7 of 12 is on bru27 Process 10 of 12 is on bru27 Process 0 of 12 is on bru25 [bru25:11224] *** Process received signal *** [bru25:11224] Signal: Segmentation fault (11) [bru25:11224] Signal code: Address not mapped (1) [bru25:11224] Failing at address: 0x9 Process 3 of 12 is on bru25 Process 6 of 12 is on bru25 Process 9 of 12 is on bru25 [bru25:11224] [ 0] /lib64/tls/libpthread.so.0 [0x38c090c420][bru25:11224] [ 1] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_btl_gm.so [0x2a9707ffb9] [bru25:11224] [ 2] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a96d71c1d] [bru25:11224] [ 3] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_pml_ob1.so [0x2a96d66753] [bru25:11224] [ 4] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c7cb1c] [bru25:11224] [ 5] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c7ce27] [bru25:11224] [ 6] /opt/testing/openmpi/1.3a1r18740/lib/openmpi/mca_coll_tuned.so [0x2a97c72eec] [bru25:11224] [ 7] /opt/testing/openmpi/1.3a1r18740/lib/libmpi.so.0(PMPI_Bcast+0x13e) [0x2a9559f05e]
[bru25:11224] [ 8] a.out(main+0xd6) [0x400d0f] [bru25:11224] [ 9] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x32ce51c4bb] [bru25:11224] [10] a.out [0x400b7a] [bru25:11224] *** End of error message *** Process 2 of 12 is on bru29[bru34:29907] -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 11224 on node ic-bru25 exited on signal 11 (Segmentation fault).
-------------------------------------------------------------------------- Any help much appreciated! -Doug
mpi-output.tar.gz
Description: GNU Zip compressed data