[OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
Hi all, I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2). I found that when running an application,which uses MPI_Isend, MPI_Irecv and MPI_Wait, enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much longer than the normal execution with mpirun (no checkpoint was taken). This overhead becomes larger when the normal execution runtime is longer. Does anybody have any idea about this overhead, and how to eliminate it? Thanks. Regards, Nguyen
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: Hi, A detailed backtrace from a core dump may help us debug this. Would you be willing to provide that information for us? Thanks, -- Samuel K. Gutierrez Los Alamos National Laboratory On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: Hi, I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ lanl/tlcc/debug-nopanasas I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same platform file (the only change was to re-enable btl-tcp). Unfortunately, the result is the same: salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi salloc: Granted job allocation 145 JOB MAP Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 0 Process OMPI jobid: [6932,1] Process rank: 1 Process OMPI jobid: [6932,1] Process rank: 2 Process OMPI jobid: [6932,1] Process rank: 3 Process OMPI jobid: [6932,1] Process rank: 4 Process OMPI jobid: [6932,1] Process rank: 5 Process OMPI jobid: [6932,1] Process rank: 6 Process OMPI jobid: [6932,1] Process rank: 7 Data for node: Name: ipc3 Num procs: 8 Process OMPI jobid: [6932,1] Process rank: 8 Process OMPI jobid: [6932,1] Process rank: 9 Process OMPI jobid: [6932,1] Process rank: 10 Process OMPI jobid: [6932,1] Process rank: 11 Process OMPI jobid: [6932,1] Process rank: 12 Process OMPI jobid: [6932,1] Process rank: 13 Process OMPI jobid: [6932,1] Process rank: 14 Process OMPI jobid: [6932,1] Process rank: 15 = [eng-ipc4:31754] *** Process received signal *** [eng-ipc4:31754] Signal: Segmentation fault (11) [eng-ipc4:31754] Signal code: Address not mapped (1) [eng-ipc4:31754] Failing at address: 0x8012eb748 [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) [0x7f81cf262869] [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) [0x7f81cef93338] [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) [0x7f81cef9397e] [eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 0(opal_event_loop+0x1f) [0x7f81cef9356f] [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 0(opal_progress+0x89) [0x7f81cef87916] [eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20] [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) [0x7f81cf267ed7] [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f81ce14bc4d] [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] [eng-ipc4:31754] *** End of error message *** salloc: Relinquishing job allocation 145 salloc: Job allocation 145 has been revoked. zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ ServerAdmin/mpi I've anonymised the paths and domain, otherwise pasted verbatim. The only odd thing I notice is that the launching machine uses its full domain name, whereas the other machine is referred to by the short name. Despite the FQDN, the domain does not exist in the DNS (for historical reasons), but does exist in the /etc/hosts file. Any further clues would be appreciated. In case it may be relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point of difference may be that our environment is tcp (ethernet) based whereas the LANL test environment is not? Michael ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
Another possibility to check - are you sure you are getting the same OMPI version on the backend nodes? When I see it work on local node, but fail multi-node, the most common problem is that you are picking up a different OMPI version due to path differences on the backend nodes. On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. > > Thanks, > > -- > Samuel K. Gutierrez > Los Alamos National Laboratory > > On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote: > >> Hi, >> >> A detailed backtrace from a core dump may help us debug this. Would you be >> willing to provide that information for us? >> >> Thanks, >> >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote: >> >>> >>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote: >>> >>> Hi, >>> I just tried to reproduce the problem that you are experiencing and was unable to. SLURM 2.1.15 Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas >>> >>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same >>> platform file (the only change was to re-enable btl-tcp). >>> >>> Unfortunately, the result is the same: >>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi >>> salloc: Granted job allocation 145 >>> >>> JOB MAP >>> >>> Data for node: Name: eng-ipc4.{FQDN}Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 0 >>> Process OMPI jobid: [6932,1] Process rank: 1 >>> Process OMPI jobid: [6932,1] Process rank: 2 >>> Process OMPI jobid: [6932,1] Process rank: 3 >>> Process OMPI jobid: [6932,1] Process rank: 4 >>> Process OMPI jobid: [6932,1] Process rank: 5 >>> Process OMPI jobid: [6932,1] Process rank: 6 >>> Process OMPI jobid: [6932,1] Process rank: 7 >>> >>> Data for node: Name: ipc3 Num procs: 8 >>> Process OMPI jobid: [6932,1] Process rank: 8 >>> Process OMPI jobid: [6932,1] Process rank: 9 >>> Process OMPI jobid: [6932,1] Process rank: 10 >>> Process OMPI jobid: [6932,1] Process rank: 11 >>> Process OMPI jobid: [6932,1] Process rank: 12 >>> Process OMPI jobid: [6932,1] Process rank: 13 >>> Process OMPI jobid: [6932,1] Process rank: 14 >>> Process OMPI jobid: [6932,1] Process rank: 15 >>> >>> = >>> [eng-ipc4:31754] *** Process received signal *** >>> [eng-ipc4:31754] Signal: Segmentation fault (11) >>> [eng-ipc4:31754] Signal code: Address not mapped (1) >>> [eng-ipc4:31754] Failing at address: 0x8012eb748 >>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0] >>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) >>> [0x7f81cf262869] >>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) >>> [0x7f81cef93338] >>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) >>> [0x7f81cef9397e] >>> [eng-ipc4:31754] [ 4] >>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f] >>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) >>> [0x7f81cef87916] >>> [eng-ipc4:31754] [ 6] >>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) >>> [0x7f81cf262e20] >>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) >>> [0x7f81cf267ed7] >>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46] >>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4] >>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) >>> [0x7f81ce14bc4d] >>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9] >>> [eng-ipc4:31754] *** End of error message *** >>> salloc: Relinquishing job allocation 145 >>> salloc: Job allocation 145 has been revoked. >>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map >>> ~/ServerAdmin/mpi >>> >>> I've anonymised the paths and domain, otherwise pasted verbatim. The only >>> odd thing I notice is that the launching machine uses its full domain name, >>> whereas the other machine is referred to by the short name. Despite the >>> FQDN, the domain does not exist in the DNS (for historical reasons), but >>> does exist in the /etc/hosts file. >>> >>> Any further clues would be appreciated. In case it may be relevant, core >>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32. One other point >>> of difference may be that our environment is tcp (ethernet) based whereas >>> the LANL test environment is not? >>> >>> Michael >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo
Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"
There are a few reasons why this might be occurring. Did you build with the '--enable-ft-thread' option? If so, it looks like I didn't move over the thread_sleep_wait adjustment from the trunk - the thread was being a bit too aggressive. Try adding the following to your command line options, and see if it changes the performance. "-mca opal_cr_thread_sleep_wait 1000" There are other places to look as well depending on how frequently your application communicates, how often you checkpoint, process layout, ... But usually the aggressive nature of the thread is the main problem. Let me know if that helps. -- Josh On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote: > Hi all, > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2). > I found that when running an application,which uses MPI_Isend, MPI_Irecv and > MPI_Wait, > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much > longer than the normal execution with mpirun (no checkpoint was taken). > This overhead becomes larger when the normal execution runtime is longer. > Does anybody have any idea about this overhead, and how to eliminate it? > Thanks. > > Regards, > Nguyen > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 2:38 AM, Ralph Castain wrote: > Another possibility to check - are you sure you are getting the same OMPI > version on the backend nodes? When I see it work on local node, but fail > multi-node, the most common problem is that you are picking up a different > OMPI version due to path differences on the backend nodes. It's installed as a system package, and the software set on all machines is managed by a configuration tool, so the machines should be identical. However, it may be worth checking the dependency versions and I'll double check that the OMPI versions really do match.
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > Hi Michael, > > You may have tried to send some debug information to the list, but it appears > to have been blocked. Compressed text output of the backtrace text is > sufficient. Odd, I thought I sent it to you directly. In any case, here is the backtrace and some information from gdb: $ salloc -n16 gdb -args mpirun mpi (gdb) run Starting program: /mnt/f1/michael/openmpi/bin/mpirun /mnt/f1/michael/home/ServerAdmin/mpi [Thread debugging using libthread_db enabled] Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) bt #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681170) at base/plm_base_launch_support.c:342 #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at event.c:823 #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at base/plm_base_launch_support.c:459 #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at plm_slurm_module.c:360 #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754 #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 (gdb) print pdatorted $1 = (orte_proc_t **) 0x67c610 (gdb) print mev $2 = (orte_message_event_t *) 0x681550 (gdb) print mev->sender.vpid $3 = 4294967295 (gdb) print mev->sender $4 = {jobid = 1721696256, vpid = 4294967295} (gdb) print *mev $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 "base/plm_base_launch_support.c", cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 "rml_oob_component.c", line = 279} That vpid looks suspiciously like -1. Further debugging: Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at base/plm_base_launch_support.c:411 411 { (gdb) print sender $2 = (orte_process_name_t *) 0x7fffe170 (gdb) print *sender $3 = {jobid = 6822016, vpid = 0} (gdb) continue Continuing. -- A daemon (pid unknown) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- Program received signal SIGSEGV, Segmentation fault. 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, data=0x681550) at base/plm_base_launch_support.c:342 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; (gdb) print mev->sender $4 = {jobid = 1778450432, vpid = 4294967295} The daemon probably died as I spent too long thinking about my gdb input ;)
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
See below On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > > On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: > >> Hi Michael, >> >> You may have tried to send some debug information to the list, but it >> appears to have been blocked. Compressed text output of the backtrace text >> is sufficient. > > > Odd, I thought I sent it to you directly. In any case, here is the backtrace > and some information from gdb: > > $ salloc -n16 gdb -args mpirun mpi > (gdb) run > Starting program: /mnt/f1/michael/openmpi/bin/mpirun > /mnt/f1/michael/home/ServerAdmin/mpi > [Thread debugging using libthread_db enabled] > > Program received signal SIGSEGV, Segmentation fault. > 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) bt > #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681170) at base/plm_base_launch_support.c:342 > #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 > #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at > event.c:823 > #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 > #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 > #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at > base/plm_base_launch_support.c:459 > #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at > plm_slurm_module.c:360 > #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at > orterun.c:754 > #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 > (gdb) print pdatorted > $1 = (orte_proc_t **) 0x67c610 > (gdb) print mev > $2 = (orte_message_event_t *) 0x681550 > (gdb) print mev->sender.vpid > $3 = 4294967295 > (gdb) print mev->sender > $4 = {jobid = 1721696256, vpid = 4294967295} > (gdb) print *mev > $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = > 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 > "base/plm_base_launch_support.c", > cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = > 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 > "rml_oob_component.c", line = 279} The jobid and vpid look like the defined INVALID values, indicating that something is quite wrong. This would quite likely lead to the segfault. >From this, it would indeed appear that you are getting some kind of library >confusion - the most likely cause of such an error is a daemon from a >different version trying to respond, and so the returned message isn't correct. Not sure why else it would be happening...you could try setting -mca plm_base_verbose 5 to get more debug output displayed on your screen, assuming you built OMPI with --enable-debug. > > That vpid looks suspiciously like -1. > > Further debugging: > Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, > buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at > base/plm_base_launch_support.c:411 > 411 { > (gdb) print sender > $2 = (orte_process_name_t *) 0x7fffe170 > (gdb) print *sender > $3 = {jobid = 6822016, vpid = 0} > (gdb) continue > Continuing. > -- > A daemon (pid unknown) died unexpectedly with status 1 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > > Program received signal SIGSEGV, Segmentation fault. > 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, > data=0x681550) at base/plm_base_launch_support.c:342 > 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; > (gdb) print mev->sender > $4 = {jobid = 1778450432, vpid = 4294967295} > > The daemon probably died as I spent too long thinking about my gdb input ;) I'm not sure why that would happen - there are no timers in the system, so it won't care how long it takes to initialize. I'm guessing this is another indicator of a library issue. > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > See below > > > On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: > >> >> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: >> >>> Hi Michael, >>> >>> You may have tried to send some debug information to the list, but it >>> appears to have been blocked. Compressed text output of the backtrace text >>> is sufficient. >> >> >> Odd, I thought I sent it to you directly. In any case, here is the >> backtrace and some information from gdb: >> >> $ salloc -n16 gdb -args mpirun mpi >> (gdb) run >> Starting program: /mnt/f1/michael/openmpi/bin/mpirun >> /mnt/f1/michael/home/ServerAdmin/mpi >> [Thread debugging using libthread_db enabled] >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >> data=0x681170) at base/plm_base_launch_support.c:342 >> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; >> (gdb) bt >> #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >> data=0x681170) at base/plm_base_launch_support.c:342 >> #1 0x778a7338 in event_process_active (base=0x615240) at event.c:651 >> #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at >> event.c:823 >> #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 >> #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 >> #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at >> base/plm_base_launch_support.c:459 >> #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at >> plm_slurm_module.c:360 >> #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at >> orterun.c:754 >> #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 >> (gdb) print pdatorted >> $1 = (orte_proc_t **) 0x67c610 >> (gdb) print mev >> $2 = (orte_message_event_t *) 0x681550 >> (gdb) print mev->sender.vpid >> $3 = 4294967295 >> (gdb) print mev->sender >> $4 = {jobid = 1721696256, vpid = 4294967295} >> (gdb) print *mev >> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = >> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 >> "base/plm_base_launch_support.c", >> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = >> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 >> "rml_oob_component.c", line = 279} > > The jobid and vpid look like the defined INVALID values, indicating that > something is quite wrong. This would quite likely lead to the segfault. > >> From this, it would indeed appear that you are getting some kind of library >> confusion - the most likely cause of such an error is a daemon from a >> different version trying to respond, and so the returned message isn't >> correct. > > Not sure why else it would be happening...you could try setting -mca > plm_base_verbose 5 to get more debug output displayed on your screen, > assuming you built OMPI with --enable-debug. > Found the problem It is a site configuration issue, which I'll need to find a workaround for. [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set priority to 75 [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 1936089714 [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1] [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes ipc3 [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv: srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted -mca ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2056716288.0;tcp://lanip:37493;tcp://globalip:37493;tcp://lanip2:37493" -mca plm_base_verbose 20 I then inserted some printf's into the ess_slurm_module (rough and ready, I know, but I was in a hurry). Just after initialisation: (at around line 345) orte_ess_slurm: jobid 2056716288 vpid 1 So it gets that... I narrowed it down to the get_slurm_nodename function, as the method didn't proceed past that point line 401: tmp = strdup(orte_process_info.nodename); printf( "Our node name == %s\n", tmp ); line 409: for (i=0; NULL != names[i]; i++) { printf( "Checking %s\n", names[ i ]); Result: Our node name == eng-ipc3.{FQDN} Checking ipc3 So it's down to the mismatch of the slurm name and the hostname. slurm really encourages you not to use the fully qualified hostname, and I'd prefer not to have to reconfigure the whole
Re: [OMPI users] Segmentation fault with SLURM and non-local nodes
I would personally suggest not reconfiguring your system simply to support a particular version of OMPI. The only difference between the 1.4 and 1.5 series wrt slurm is that we changed a few things to support a more recent version of slurm. It is relatively easy to backport that code to the 1.4 series, and it should be (mostly) backward compatible. OMPI is agnostic wrt resource managers. We try to support all platforms, with our effort reflective of the needs of our developers and their organizations, and our perception of the relative size of the user community for a particular platform. Slurm is a fairly small community, mostly centered in the three DOE weapons labs, so our support for that platform tends to focus on their usage. So, with that understanding... Sam: can you confirm that 1.5.1 works on your TLCC machines? I have created a ticket to upgrade the 1.4.4 release (due out any time now) with the 1.5.1 slurm support. Any interested parties can follow it here: https://svn.open-mpi.org/trac/ompi/ticket/2717 Ralph On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote: > > On 09/02/2011, at 9:16 AM, Ralph Castain wrote: > >> See below >> >> >> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote: >> >>> >>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote: >>> Hi Michael, You may have tried to send some debug information to the list, but it appears to have been blocked. Compressed text output of the backtrace text is sufficient. >>> >>> >>> Odd, I thought I sent it to you directly. In any case, here is the >>> backtrace and some information from gdb: >>> >>> $ salloc -n16 gdb -args mpirun mpi >>> (gdb) run >>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun >>> /mnt/f1/michael/home/ServerAdmin/mpi >>> [Thread debugging using libthread_db enabled] >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >>> data=0x681170) at base/plm_base_launch_support.c:342 >>> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING; >>> (gdb) bt >>> #0 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, >>> data=0x681170) at base/plm_base_launch_support.c:342 >>> #1 0x778a7338 in event_process_active (base=0x615240) at >>> event.c:651 >>> #2 0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at >>> event.c:823 >>> #3 0x778a756f in opal_event_loop (flags=1) at event.c:730 >>> #4 0x7789b916 in opal_progress () at runtime/opal_progress.c:189 >>> #5 0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at >>> base/plm_base_launch_support.c:459 >>> #6 0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at >>> plm_slurm_module.c:360 >>> #7 0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at >>> orterun.c:754 >>> #8 0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13 >>> (gdb) print pdatorted >>> $1 = (orte_proc_t **) 0x67c610 >>> (gdb) print mev >>> $2 = (orte_message_event_t *) 0x681550 >>> (gdb) print mev->sender.vpid >>> $3 = 4294967295 >>> (gdb) print mev->sender >>> $4 = {jobid = 1721696256, vpid = 4294967295} >>> (gdb) print *mev >>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = >>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = >>> 0x77bb9a78 "base/plm_base_launch_support.c", >>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = >>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 >>> "rml_oob_component.c", line = 279} >> >> The jobid and vpid look like the defined INVALID values, indicating that >> something is quite wrong. This would quite likely lead to the segfault. >> >>> From this, it would indeed appear that you are getting some kind of library >>> confusion - the most likely cause of such an error is a daemon from a >>> different version trying to respond, and so the returned message isn't >>> correct. >> >> Not sure why else it would be happening...you could try setting -mca >> plm_base_verbose 5 to get more debug output displayed on your screen, >> assuming you built OMPI with --enable-debug. >> > > Found the problem It is a site configuration issue, which I'll need to > find a workaround for. > > [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [bio-ipc.{FQDN}:27523] mca:base:select:( plm) Selected component [slurm] > [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed > [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh > [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename > hash 1936089714 > [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383 > [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm > [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1] > [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:ba