I unfortunately do not have these versions of compilers to test this particular scenario.

I don't see anything obvious in the stack trace that would be causing a problem.

I'm assuming that /sw/sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/ openmpi exists and is populated with all the components for the 1.2.1 installation (and no other plugins), right?

Can you run ompi_info, or does it also segv? (based on the stack trace, I'm guessing that it will -- this is the code that is trying to open Open MPI's plugins)


On May 8, 2007, at 12:27 AM, Luis Kornblueh wrote:

Hi everybody,

we've got some problems on our cluster with openmpi versions 1.2 and
upward.

The following setup does work:

openmpi-1.2b3: SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1

The following two setups give a SISEGV in mpiexec (stack see below)

openmpi-1.2:   SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1
openmpi-1.2.1: SLES 9 SP3 with gcc/g++ 4.1.1 and PGI f95 6.1-1

All have been compiled with

export F77=pgf95
export FC=pgf95

./configure --prefix=/sw/sles9-x64/voltaire/openmpi-1.2b3-pgi \
            --enable-pretty-print-stacktrace \
            --with-libnuma=/usr \
            --with-mvapi=/usr \
            --with-mvapi-libdir=/usr/lib64

(with changing prefix, of course)

The stack trace:

Starting program: /scratch/work/system/sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/bin/mpiexec -host tornado1 --prefix=$MPIROOT -v - np 8 `pwd`/osu_bw
[Thread debugging using libthread_db enabled]
[New Thread 182906198784 (LWP 30805)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182906198784 (LWP 30805)]
0x0000002a957f1b5b in _int_free () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-pal.so.0
(gdb) where
#0 0x0000002a957f1b5b in _int_free () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #1 0x0000002a957f1e7d in free () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #2 0x0000002a95563b72 in __tls_get_addr () from /lib64/ld-linux- x86-64.so.2 #3 0x0000002a95fb51ec in __libc_dl_error_tsd () from /lib64/tls/ libc.so.6 #4 0x0000002a95dba6ec in __pthread_initialize_minimal_internal () from /lib64/tls/libpthread.so.0 #5 0x0000002a95dba419 in call_initialize_minimal () from /lib64/ tls/libpthread.so.0
#6  0x0000002a95ec9000 in ?? ()
#7  0x0000002a95db9fe9 in _init () from /lib64/tls/libpthread.so.0
#8  0x0000007fbfffe7c0 in ?? ()
#9 0x0000002a9556168d in call_init () from /lib64/ld-linux- x86-64.so.2 #10 0x0000002a9556179b in _dl_init_internal () from /lib64/ld-linux- x86-64.so.2
#11 0x0000002a95fb39ac in dl_open_worker () from /lib64/tls/libc.so.6
#12 0x0000002a955612de in _dl_catch_error () from /lib64/ld-linux- x86-64.so.2
#13 0x0000002a95fb3160 in _dl_open () from /lib64/tls/libc.so.6
#14 0x0000002a959413b5 in dlopen_doit () from /lib64/libdl.so.2
#15 0x0000002a955612de in _dl_catch_error () from /lib64/ld-linux- x86-64.so.2
#16 0x0000002a959416fa in _dlerror_run () from /lib64/libdl.so.2
#17 0x0000002a95941362 in dlopen@@GLIBC_2.2.5 () from /lib64/ libdl.so.2 #18 0x0000002a957db2ee in vm_open () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #19 0x0000002a957d9645 in tryall_dlopen () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #20 0x0000002a957d981e in tryall_dlopen_module () from /sw/sles9- x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #21 0x0000002a957daab1 in try_dlopen () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #22 0x0000002a957dacd6 in lt_dlopenext () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #23 0x0000002a957e04f5 in open_component () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #24 0x0000002a957e0f60 in mca_base_component_find () from /sw/sles9- x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #25 0x0000002a957e189c in mca_base_components_open () from /sw/ sles9-x64/voltaire/openmpi-1.2.1-pgi/lib/libopen-pal.so.0 #26 0x0000002a956a6119 in orte_rds_base_open () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0 #27 0x0000002a95681d18 in orte_init_stage1 () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0 #28 0x0000002a95684eba in orte_system_init () from /sw/sles9-x64/ voltaire/openmpi-1.2.1-pgi/lib/libopen-rte.so.0 #29 0x0000002a9568179d in orte_init () from /sw/sles9-x64/voltaire/ openmpi-1.2.1-pgi/lib/libopen-rte.so.0 #30 0x0000000000402a3a in orterun (argc=8, argv=0x7fbfffe778) at orterun.c:374 #31 0x00000000004028d3 in main (argc=8, argv=0x7fbfffe778) at main.c:13
(gdb) quit

In case access to our cluster could help, we would be happy to
provide an account.

Cheerio,
Luis
--
                             \\\\\\
                             (-0^0-)
--------------------------oOO--(_)--OOo-----------------------------

 Luis Kornblueh                           Tel. : +49-40-41173289
 Max-Planck-Institute for Meteorology     Fax. : +49-40-41173298
 Bundesstr. 53
 D-20146 Hamburg                   Email: luis.kornbl...@zmaw.de
 Federal Republic of Germany
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to