Brian, Will do immediately, don't ask why I didn't think of doing this. I have serious doubts that this would be an openmpi bug to start with since this is a very common platform... But, as I said, this is a rather peculiar environment and, maybe openmpi does something unionfs really doesn't like (if anyone else on the mailling list is using OpenMPI with Unionfs, would be nice to know).
Eric Le dimanche 16 juillet 2006 14:31, Brian Barrett a écrit : > On Jul 15, 2006, at 2:58 PM, Eric Thibodeau wrote: > > But, for some reason, on the Athlon node (in their image on the > > server I should say) OpenMPI still doesn't seem to be built > > correctly since it crashes as follows: > > > > > > kyron@node0 ~ $ mpirun -np 1 uptime > > > > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) > > > > Failing at addr:(nil) > > > > [0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f] > > > > [1] func:[0xffffe440] > > > > [2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1 > > +0x1d7) [0xb7fa0227] > > > > [3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init > > +0x23) [0xb7fa3683] > > > > [4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f) > > [0xb7f9ff7f] > > > > [5] func:mpirun(orterun+0x255) [0x804a015] > > > > [6] func:mpirun(main+0x22) [0x8049db6] > > > > [7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b] > > > > [8] func:mpirun [0x8049d11] > > > > *** End of error message *** > > > > Segmentation fault > > > > > > The crash happens both in the chrooted env and on the nodes. I > > configured both systems to have Linux and POSIX threads, though I > > see openmpi is calling the POSIX version (a message on the mailling > > list had hinted on keeping the Linux threads around...I have to > > anyways since sone apps like Matlab extensions still depend on > > this...). The following is the output for the libc info. > > That's interesting... We regularly build Open MPI on 32 bit Linux > machines (and in 32 bit mode on Opteron machines) without too much > issue. It looks like we're jumping into a NULL pointer, which > generally means that a ORTE framework failed to initialize itself > properly. It would be useful if you could rebuild with debugging > symbols (just add -g to CFLAGS when configuring) and run mpirun in > gdb. If we can determine where the error is occurring, that would > definitely help in debugging your problem. > > Brian >