On 27/01/2011, at 4:51 PM, Michael Curtis wrote: Some more debugging information:
> Failing case: > michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi > ======================== JOB MAP ======================== Backtrace with debugging symbols #0 0x00007ffff7bb5c1e in ?? () from /usr/lib/libopen-rte.so.0 #1 0x00007ffff792e23f in ?? () from /usr/lib/libopen-pal.so.0 #2 0x00007ffff7920679 in opal_progress () from /usr/lib/libopen-pal.so.0 #3 0x00007ffff7bb6e5d in orte_plm_base_daemon_callback () from /usr/lib/libopen-rte.so.0 #4 0x00007ffff62b67e7 in plm_slurm_launch_job (jdata=<value optimised out>) at ../../../../../../orte/mca/plm/slurm/plm_slurm_module.c:360 #5 0x00000000004041c8 in orterun (argc=4, argv=0x7fffffffe7d8) at ../../../../../orte/tools/orterun/orterun.c:754 #6 0x0000000000403234 in main (argc=4, argv=0x7fffffffe7d8) at ../../../../../orte/tools/orterun/main.c:13 Trace output with -d100 and --enable-trace: [:10821] progressed_wait: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 459 [:10821] defining message event: ../../../../../orte/mca/plm/base/plm_base_launch_support.c 423 I'm guessing from this that it's crashing in the event loop, maybe at : static void process_orted_launch_report(int fd, short event, void *data) strace: poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 1000) = 1 ([{fd=13, revents=POLLIN}]) readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\4\0\0\0\232"..., 36}], 1) = 36 readv(13, [{"R\333\0\0\377\377\377\377R\333\0\0\0\0\0\0\0\0\0\n\0\0\0\1\0\0\0u1390"..., 154}], 1) = 154 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}], 6, 0) = 0 (Timeout) --- SIGSEGV (Segmentation fault) @ 0 (0) --- OK, I matched the disassemblies and confirmed that the crash originates in process_orted_launch_report, and therefore matched up the source code line with where gdb reckons the program counter was at that point: /* update state */ pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; Hopefully all this information helps a little!