On Fri, 9 Sep 2011, Brice Goglin wrote:
Le 09/09/2011 21:03, Kaizaad Bilimorya a écrit :
We seem to have an issue similar to this thread
"Bug in openmpi 1.5.4 in paffinity"
http://www.open-mpi.org/community/lists/users/2011/09/17151.php
Using the following version of hwloc (from EPEL repo - we run CentOS 5.6)
$ hwloc-info --version
hwloc-info 1.1rc6
Hello,
Note that Open MPI 1.5.4 uses its own embedded copy of hwloc 1.2.0.
Ok thanks, good to know.
Your own 1.1rc6 should actual work fine (does lstopo crash?) but OMPI
cannot use it :)
lstopo works. When we first got these chips I ran it (great tool btw, gave
me a better understanding of the chip architecture). It shows an
"interesting" picture for Magny-Cours (ie: 2 die's per socket along with 2
NumaNodes - yes Magny-Cours is a strange beast). We knew this was the
case, it is just nice to see the diagram in all its glory:
http://www.sharcnet.ca/~kaizaad/orca/orca_lstopo.jpg
A simple "mpi_hello" program works fine with cpusets and openMPI 1.4.2
but with openMPI 1.5.3 and cpusets we get the following segfault
(works fine on the node without cpusets enabled):
[red2:28263] *** Process received signal *** [red2:28263] Signal:
Segmentation fault (11) [red2:28263] Signal code: Address not mapped
(1) [red2:28263] Failing at address: 0x8 [red2:28263] [ 0]
/lib64/libpthread.so.0 [0x2b3dce315b10] [red2:28263] [ 1]
/opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so(opal_paffinity_hwloc_bitmap_or+0x142) [0x2b3dcef75cb2] [red2:28263] [ 2]
/opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef71404] [red2:28263] [ 3]
/opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef6bb26] [red2:28263] [ 4]
/opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so(opal_paffinity_hwloc_topology_load+0xe2) [0x2b3dcef6e0b2] [red2:28263] [ 5]
/opt/sharcnet/openmpi/1.5.4/intel/lib/openmpi/mca_paffinity_hwloc.so [0x2b3dcef68b72] [red2:28263] [ 6]
/opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(mca_base_components_open+0x302) [0x2b3dcd2b08f2] [red2:28263] [ 7]
/opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(opal_paffinity_base_open+0x67) [0x2b3dcd2d3a87] [red2:28263] [ 8]
/opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(opal_init+0x71) [0x2b3dcd28bfb1] [red2:28263] [ 9]
/opt/sharcnet/openmpi/1.5.4/intel/lib/libopen-rte.so.3(orte_init+0x23) [0x2b3dcd2318f3] [red2:28263] [10]
/opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x4049b5] [red2:28263] [11] /opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x404388]
[red2:28263] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b3dce540994] [red2:28263] [13]
/opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun [0x4042b9] [red2:28263]
*** End of error message *** /var/spool/torque/mom_priv/jobs/968.SC:
line 3: 28263 Segmentation fault
/opt/sharcnet/openmpi/1.5.4/intel/bin/mpirun -np 2 ./a.out
Please let me know if you need more information about this issue
This looks like the exact same issue. Did you try the patch(es) I sent
earlier?
See http://www.open-mpi.org/community/lists/users/2011/09/17159.php
If it's not enough, try adding the other patch from
http://www.open-mpi.org/community/lists/users/2011/09/17156.php
Brice
I'll do that now.
thanks a bunch
-k