On 27/01/2011, at 4:51 PM, Michael Curtis wrote:

Some more debugging information:

> Failing case:
> michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi
> ========================   JOB MAP   ========================

Backtrace with debugging symbols
#0  0x00007ffff7bb5c1e in ?? () from /usr/lib/libopen-rte.so.0
#1  0x00007ffff792e23f in ?? () from /usr/lib/libopen-pal.so.0
#2  0x00007ffff7920679 in opal_progress () from /usr/lib/libopen-pal.so.0
#3  0x00007ffff7bb6e5d in orte_plm_base_daemon_callback () from 
/usr/lib/libopen-rte.so.0
#4  0x00007ffff62b67e7 in plm_slurm_launch_job (jdata=<value optimised out>) at 
../../../../../../orte/mca/plm/slurm/plm_slurm_module.c:360
#5  0x00000000004041c8 in orterun (argc=4, argv=0x7fffffffe7d8) at 
../../../../../orte/tools/orterun/orterun.c:754
#6  0x0000000000403234 in main (argc=4, argv=0x7fffffffe7d8) at 
../../../../../orte/tools/orterun/main.c:13

Trace output with -d100 and --enable-trace:
[:10821] progressed_wait: 
../../../../../orte/mca/plm/base/plm_base_launch_support.c 459
[:10821] defining message event: 
../../../../../orte/mca/plm/base/plm_base_launch_support.c 423

I'm guessing from this that it's crashing in the event loop, maybe at :
        static void process_orted_launch_report(int fd, short event, void *data)

strace:
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 
1000) = 1 ([{fd=13, revents=POLLIN}])
readv(13, 
[{"R\333\0\0\377\377\377\377R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\4\0\0\0\232"...,
 36}], 1) = 36
readv(13, 
[{"R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\n\0\0\0\1\0\0\0u1390"..., 
154}], 1) = 154
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 0) = 
0 (Timeout)
--- SIGSEGV (Segmentation fault) @ 0 (0) ---


OK, I matched the disassemblies and confirmed that the crash originates in 
process_orted_launch_report, and therefore matched up the source code line with 
where gdb reckons the program counter was at that point:

    /* update state */
    pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;

Hopefully all this information helps a little!



Reply via email to