date:20110208

[OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-08 Thread Nguyen Toan

Hi all,

I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
I found that when running an application,which uses MPI_Isend, MPI_Irecv and
MPI_Wait,
enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much
longer than the normal execution with mpirun (no checkpoint was taken).
This overhead becomes larger when the normal execution runtime is longer.
Does anybody have any idea about this overhead, and how to eliminate it?
Thanks.

Regards,
Nguyen

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Samuel K. Gutierrez


Hi Michael,

You may have tried to send some debug information to the list, but it  
appears to have been blocked.  Compressed text output of the backtrace  
text is sufficient.


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:


Hi,

A detailed backtrace from a core dump may help us debug this.  Would  
you be willing to provide that information for us?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:



On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:

Hi,

I just tried to reproduce the problem that you are experiencing  
and was unable to.


SLURM 2.1.15
Open MPI 1.4.3 configured with: --with-platform=./contrib/platform/ 
lanl/tlcc/debug-nopanasas


I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the  
same platform file (the only change was to re-enable btl-tcp).


Unfortunately, the result is the same:
salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
salloc: Granted job allocation 145

   JOB MAP   

Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 0
Process OMPI jobid: [6932,1] Process rank: 1
Process OMPI jobid: [6932,1] Process rank: 2
Process OMPI jobid: [6932,1] Process rank: 3
Process OMPI jobid: [6932,1] Process rank: 4
Process OMPI jobid: [6932,1] Process rank: 5
Process OMPI jobid: [6932,1] Process rank: 6
Process OMPI jobid: [6932,1] Process rank: 7

Data for node: Name: ipc3   Num procs: 8
Process OMPI jobid: [6932,1] Process rank: 8
Process OMPI jobid: [6932,1] Process rank: 9
Process OMPI jobid: [6932,1] Process rank: 10
Process OMPI jobid: [6932,1] Process rank: 11
Process OMPI jobid: [6932,1] Process rank: 12
Process OMPI jobid: [6932,1] Process rank: 13
Process OMPI jobid: [6932,1] Process rank: 14
Process OMPI jobid: [6932,1] Process rank: 15

=
[eng-ipc4:31754] *** Process received signal ***
[eng-ipc4:31754] Signal: Segmentation fault (11)
[eng-ipc4:31754] Signal code: Address not mapped (1)
[eng-ipc4:31754] Failing at address: 0x8012eb748
[eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
[eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869)  
[0x7f81cf262869]
[eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338)  
[0x7f81cef93338]
[eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e)  
[0x7f81cef9397e]
[eng-ipc4:31754] [ 4] ~/../openmpi/lib/libopen-pal.so. 
0(opal_event_loop+0x1f) [0x7f81cef9356f]
[eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so. 
0(opal_progress+0x89) [0x7f81cef87916]
[eng-ipc4:31754] [ 6] ~/../openmpi/lib/libopen-rte.so. 
0(orte_plm_base_daemon_callback+0x13f) [0x7f81cf262e20]
[eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7)  
[0x7f81cf267ed7]

[eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
[eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
[eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd)  
[0x7f81ce14bc4d]

[eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
[eng-ipc4:31754] *** End of error message ***
salloc: Relinquishing job allocation 145
salloc: Job allocation 145 has been revoked.
zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ 
ServerAdmin/mpi


I've anonymised the paths and domain, otherwise pasted verbatim.   
The only odd thing I notice is that the launching machine uses its  
full domain name, whereas the other machine is referred to by the  
short name.  Despite the FQDN, the domain does not exist in the DNS  
(for historical reasons), but does exist in the /etc/hosts file.


Any further clues would be appreciated.  In case it may be  
relevant, core system versions are: glibc 2.11, gcc 4.4.3, kernel  
2.6.32.  One other point of difference may be that our environment  
is tcp (ethernet) based whereas the LANL test environment is not?


Michael


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain

Another possibility to check - are you sure you are getting the same OMPI 
version on the backend nodes? When I see it work on local node, but fail 
multi-node, the most common problem is that you are picking up a different OMPI 
version due to path differences on the backend nodes.


On Feb 8, 2011, at 8:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
> 
> You may have tried to send some debug information to the list, but it appears 
> to have been blocked.  Compressed text output of the backtrace text is 
> sufficient.
> 
> Thanks,
> 
> --
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> On Feb 7, 2011, at 8:38 AM, Samuel K. Gutierrez wrote:
> 
>> Hi,
>> 
>> A detailed backtrace from a core dump may help us debug this.  Would you be 
>> willing to provide that information for us?
>> 
>> Thanks,
>> 
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>> 
>> On Feb 6, 2011, at 6:36 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 04/02/2011, at 9:35 AM, Samuel K. Gutierrez wrote:
>>> 
>>> Hi,
>>> 
 I just tried to reproduce the problem that you are experiencing and was 
 unable to.
 
 SLURM 2.1.15
 Open MPI 1.4.3 configured with: 
 --with-platform=./contrib/platform/lanl/tlcc/debug-nopanasas
>>> 
>>> I compiled OpenMPI 1.4.3 (vanilla from source tarball) with the same 
>>> platform file (the only change was to re-enable btl-tcp).
>>> 
>>> Unfortunately, the result is the same:
>>> salloc -n16 ~/../openmpi/bin/mpirun --display-map ~/ServerAdmin/mpi
>>> salloc: Granted job allocation 145
>>> 
>>>    JOB MAP   
>>> 
>>> Data for node: Name: eng-ipc4.{FQDN}Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 0
>>> Process OMPI jobid: [6932,1] Process rank: 1
>>> Process OMPI jobid: [6932,1] Process rank: 2
>>> Process OMPI jobid: [6932,1] Process rank: 3
>>> Process OMPI jobid: [6932,1] Process rank: 4
>>> Process OMPI jobid: [6932,1] Process rank: 5
>>> Process OMPI jobid: [6932,1] Process rank: 6
>>> Process OMPI jobid: [6932,1] Process rank: 7
>>> 
>>> Data for node: Name: ipc3   Num procs: 8
>>> Process OMPI jobid: [6932,1] Process rank: 8
>>> Process OMPI jobid: [6932,1] Process rank: 9
>>> Process OMPI jobid: [6932,1] Process rank: 10
>>> Process OMPI jobid: [6932,1] Process rank: 11
>>> Process OMPI jobid: [6932,1] Process rank: 12
>>> Process OMPI jobid: [6932,1] Process rank: 13
>>> Process OMPI jobid: [6932,1] Process rank: 14
>>> Process OMPI jobid: [6932,1] Process rank: 15
>>> 
>>> =
>>> [eng-ipc4:31754] *** Process received signal ***
>>> [eng-ipc4:31754] Signal: Segmentation fault (11)
>>> [eng-ipc4:31754] Signal code: Address not mapped (1)
>>> [eng-ipc4:31754] Failing at address: 0x8012eb748
>>> [eng-ipc4:31754] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f81ce4bf8f0]
>>> [eng-ipc4:31754] [ 1] ~/../openmpi/lib/libopen-rte.so.0(+0x7f869) 
>>> [0x7f81cf262869]
>>> [eng-ipc4:31754] [ 2] ~/../openmpi/lib/libopen-pal.so.0(+0x22338) 
>>> [0x7f81cef93338]
>>> [eng-ipc4:31754] [ 3] ~/../openmpi/lib/libopen-pal.so.0(+0x2297e) 
>>> [0x7f81cef9397e]
>>> [eng-ipc4:31754] [ 4] 
>>> ~/../openmpi/lib/libopen-pal.so.0(opal_event_loop+0x1f) [0x7f81cef9356f]
>>> [eng-ipc4:31754] [ 5] ~/../openmpi/lib/libopen-pal.so.0(opal_progress+0x89) 
>>> [0x7f81cef87916]
>>> [eng-ipc4:31754] [ 6] 
>>> ~/../openmpi/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x13f) 
>>> [0x7f81cf262e20]
>>> [eng-ipc4:31754] [ 7] ~/../openmpi/lib/libopen-rte.so.0(+0x84ed7) 
>>> [0x7f81cf267ed7]
>>> [eng-ipc4:31754] [ 8] ~/../home/../openmpi/bin/mpirun() [0x403f46]
>>> [eng-ipc4:31754] [ 9] ~/../home/../openmpi/bin/mpirun() [0x402fb4]
>>> [eng-ipc4:31754] [10] /lib/libc.so.6(__libc_start_main+0xfd) 
>>> [0x7f81ce14bc4d]
>>> [eng-ipc4:31754] [11] ~/../openmpi/bin/mpirun() [0x402ed9]
>>> [eng-ipc4:31754] *** End of error message ***
>>> salloc: Relinquishing job allocation 145
>>> salloc: Job allocation 145 has been revoked.
>>> zsh: exit 1 salloc -n16 ~/../openmpi/bin/mpirun --display-map 
>>> ~/ServerAdmin/mpi
>>> 
>>> I've anonymised the paths and domain, otherwise pasted verbatim.  The only 
>>> odd thing I notice is that the launching machine uses its full domain name, 
>>> whereas the other machine is referred to by the short name.  Despite the 
>>> FQDN, the domain does not exist in the DNS (for historical reasons), but 
>>> does exist in the /etc/hosts file.
>>> 
>>> Any further clues would be appreciated.  In case it may be relevant, core 
>>> system versions are: glibc 2.11, gcc 4.4.3, kernel 2.6.32.  One other point 
>>> of difference may be that our environment is tcp (ethernet) based whereas 
>>> the LANL test environment is not?
>>> 
>>> Michael
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-08 Thread Joshua Hursey

There are a few reasons why this might be occurring. Did you build with the 
'--enable-ft-thread' option?

If so, it looks like I didn't move over the thread_sleep_wait adjustment from 
the trunk - the thread was being a bit too aggressive. Try adding the following 
to your command line options, and see if it changes the performance.
  "-mca opal_cr_thread_sleep_wait 1000"

There are other places to look as well depending on how frequently your 
application communicates, how often you checkpoint, process layout, ... But 
usually the aggressive nature of the thread is the main problem.

Let me know if that helps.

-- Josh

On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:

> Hi all,
> 
> I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> I found that when running an application,which uses MPI_Isend, MPI_Irecv and 
> MPI_Wait,  
> enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much 
> longer than the normal execution with mpirun (no checkpoint was taken).
> This overhead becomes larger when the normal execution runtime is longer.
> Does anybody have any idea about this overhead, and how to eliminate it?
> Thanks.
> 
> Regards,
> Nguyen
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis

On 09/02/2011, at 2:38 AM, Ralph Castain wrote:

> Another possibility to check - are you sure you are getting the same OMPI 
> version on the backend nodes? When I see it work on local node, but fail 
> multi-node, the most common problem is that you are picking up a different 
> OMPI version due to path differences on the backend nodes.

It's installed as a system package, and the software set on all machines is 
managed by a configuration tool, so the machines should be identical.  However, 
it may be worth checking the dependency versions and I'll double check that the 
OMPI versions really do match.

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis


On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:

> Hi Michael,
> 
> You may have tried to send some debug information to the list, but it appears 
> to have been blocked.  Compressed text output of the backtrace text is 
> sufficient.


Odd, I thought I sent it to you directly.  In any case, here is the backtrace 
and some information from gdb:

$ salloc -n16 gdb -args mpirun mpi
(gdb) run
Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
/mnt/f1/michael/home/ServerAdmin/mpi
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681170) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) bt
#0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681170) at base/plm_base_launch_support.c:342
#1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
#2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
event.c:823
#3  0x778a756f in opal_event_loop (flags=1) at event.c:730
#4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
#5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
base/plm_base_launch_support.c:459
#6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
plm_slurm_module.c:360
#7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at orterun.c:754
#8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
(gdb) print pdatorted
$1 = (orte_proc_t **) 0x67c610
(gdb) print mev
$2 = (orte_message_event_t *) 0x681550
(gdb) print mev->sender.vpid
$3 = 4294967295
(gdb) print mev->sender
$4 = {jobid = 1721696256, vpid = 4294967295}
(gdb) print *mev
$5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 0x77dd4f40, 
obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
"base/plm_base_launch_support.c", 
   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
"rml_oob_component.c", line = 279}

That vpid looks suspiciously like -1.

Further debugging:
Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, 
buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at 
base/plm_base_launch_support.c:411
411 {
(gdb) print sender
$2 = (orte_process_name_t *) 0x7fffe170
(gdb) print *sender
$3 = {jobid = 6822016, vpid = 0}
(gdb) continue
Continuing.
--
A daemon (pid unknown) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--

Program received signal SIGSEGV, Segmentation fault.
0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
data=0x681550) at base/plm_base_launch_support.c:342
342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
(gdb) print mev->sender
$4 = {jobid = 1778450432, vpid = 4294967295}

The daemon probably died as I spent too long thinking about my gdb input ;)

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain

See below


On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
> 
>> Hi Michael,
>> 
>> You may have tried to send some debug information to the list, but it 
>> appears to have been blocked.  Compressed text output of the backtrace text 
>> is sufficient.
> 
> 
> Odd, I thought I sent it to you directly.  In any case, here is the backtrace 
> and some information from gdb:
> 
> $ salloc -n16 gdb -args mpirun mpi
> (gdb) run
> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
> /mnt/f1/michael/home/ServerAdmin/mpi
> [Thread debugging using libthread_db enabled]
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> 342   pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) bt
> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681170) at base/plm_base_launch_support.c:342
> #1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
> event.c:823
> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
> base/plm_base_launch_support.c:459
> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
> plm_slurm_module.c:360
> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
> orterun.c:754
> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
> (gdb) print pdatorted
> $1 = (orte_proc_t **) 0x67c610
> (gdb) print mev
> $2 = (orte_message_event_t *) 0x681550
> (gdb) print mev->sender.vpid
> $3 = 4294967295
> (gdb) print mev->sender
> $4 = {jobid = 1721696256, vpid = 4294967295}
> (gdb) print *mev
> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
> "base/plm_base_launch_support.c", 
>   cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
> "rml_oob_component.c", line = 279}

The jobid and vpid look like the defined INVALID values, indicating that 
something is quite wrong. This would quite likely lead to the segfault.

>From this, it would indeed appear that you are getting some kind of library 
>confusion - the most likely cause of such an error is a daemon from a 
>different version trying to respond, and so the returned message isn't correct.

Not sure why else it would be happening...you could try setting -mca 
plm_base_verbose 5 to get more debug output displayed on your screen, assuming 
you built OMPI with --enable-debug.


> 
> That vpid looks suspiciously like -1.
> 
> Further debugging:
> Breakpoint 3, orted_report_launch (status=32767, sender=0x7fffe170, 
> buffer=0x77b1a85f, tag=32767, cbdata=0x612d20) at 
> base/plm_base_launch_support.c:411
> 411   {
> (gdb) print sender
> $2 = (orte_process_name_t *) 0x7fffe170
> (gdb) print *sender
> $3 = {jobid = 6822016, vpid = 0}
> (gdb) continue
> Continuing.
> --
> A daemon (pid unknown) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
> data=0x681550) at base/plm_base_launch_support.c:342
> 342   pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
> (gdb) print mev->sender
> $4 = {jobid = 1778450432, vpid = 4294967295}
> 
> The daemon probably died as I spent too long thinking about my gdb input ;)

I'm not sure why that would happen - there are no timers in the system, so it 
won't care how long it takes to initialize. I'm guessing this is another 
indicator of a library issue.


> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Michael Curtis


On 09/02/2011, at 9:16 AM, Ralph Castain wrote:

> See below
> 
> 
> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
> 
>> 
>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>> 
>>> Hi Michael,
>>> 
>>> You may have tried to send some debug information to the list, but it 
>>> appears to have been blocked.  Compressed text output of the backtrace text 
>>> is sufficient.
>> 
>> 
>> Odd, I thought I sent it to you directly.  In any case, here is the 
>> backtrace and some information from gdb:
>> 
>> $ salloc -n16 gdb -args mpirun mpi
>> (gdb) run
>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
>> /mnt/f1/michael/home/ServerAdmin/mpi
>> [Thread debugging using libthread_db enabled]
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>> data=0x681170) at base/plm_base_launch_support.c:342
>> 342  pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
>> (gdb) bt
>> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>> data=0x681170) at base/plm_base_launch_support.c:342
>> #1  0x778a7338 in event_process_active (base=0x615240) at event.c:651
>> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
>> event.c:823
>> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
>> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
>> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
>> base/plm_base_launch_support.c:459
>> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
>> plm_slurm_module.c:360
>> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
>> orterun.c:754
>> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
>> (gdb) print pdatorted
>> $1 = (orte_proc_t **) 0x67c610
>> (gdb) print mev
>> $2 = (orte_message_event_t *) 0x681550
>> (gdb) print mev->sender.vpid
>> $3 = 4294967295
>> (gdb) print mev->sender
>> $4 = {jobid = 1721696256, vpid = 4294967295}
>> (gdb) print *mev
>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 0x77bb9a78 
>> "base/plm_base_launch_support.c", 
>>  cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
>> "rml_oob_component.c", line = 279}
> 
> The jobid and vpid look like the defined INVALID values, indicating that 
> something is quite wrong. This would quite likely lead to the segfault.
> 
>> From this, it would indeed appear that you are getting some kind of library 
>> confusion - the most likely cause of such an error is a daemon from a 
>> different version trying to respond, and so the returned message isn't 
>> correct.
> 
> Not sure why else it would be happening...you could try setting -mca 
> plm_base_verbose 5 to get more debug output displayed on your screen, 
> assuming you built OMPI with --enable-debug.
> 

Found the problem It is a site configuration issue, which I'll need to find 
a workaround for.

[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component [slurm] set 
priority to 75
[bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component [slurm]
[bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
[bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename hash 
1936089714
[bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:setup_job for job [31383,1]
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching on nodes ipc3
[bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: final top-level argv:
srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=ipc3 orted -mca 
ess slurm -mca orte_ess_jobid 2056716288 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 --hnp-uri 
"2056716288.0;tcp://lanip:37493;tcp://globalip:37493;tcp://lanip2:37493" -mca 
plm_base_verbose 20

I then inserted some printf's into the ess_slurm_module (rough and ready, I 
know, but I was in a hurry).

Just after initialisation: (at around line 345)
orte_ess_slurm: jobid 2056716288 vpid 1
So it gets that...
I narrowed it down to the get_slurm_nodename  function, as the method didn't 
proceed past that point

line 401:
tmp = strdup(orte_process_info.nodename);
printf( "Our node name == %s\n", tmp );
line 409:
for (i=0; NULL !=  names[i]; i++) {
  printf( "Checking %s\n", names[ i ]);

Result:
Our node name == eng-ipc3.{FQDN}
Checking ipc3

So it's down to the mismatch of the slurm name and the hostname.  slurm really 
encourages you not to use the fully qualified hostname, and I'd prefer not to 
have to reconfigure the whole

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

2011-02-08 Thread Ralph Castain

I would personally suggest not reconfiguring your system simply to support a 
particular version of OMPI. The only difference between the 1.4 and 1.5 series 
wrt slurm is that we changed a few things to support a more recent version of 
slurm. It is relatively easy to backport that code to the 1.4 series, and it 
should be (mostly) backward compatible.

OMPI is agnostic wrt resource managers. We try to support all platforms, with 
our effort reflective of the needs of our developers and their organizations, 
and our perception of the relative size of the user community for a particular 
platform. Slurm is a fairly small community, mostly centered in the three DOE 
weapons labs, so our support for that platform tends to focus on their usage.

So, with that understanding...

Sam: can you confirm that 1.5.1 works on your TLCC machines?

I have created a ticket to upgrade the 1.4.4 release (due out any time now) 
with the 1.5.1 slurm support. Any interested parties can follow it here:

https://svn.open-mpi.org/trac/ompi/ticket/2717

Ralph


On Feb 8, 2011, at 6:23 PM, Michael Curtis wrote:

> 
> On 09/02/2011, at 9:16 AM, Ralph Castain wrote:
> 
>> See below
>> 
>> 
>> On Feb 8, 2011, at 2:44 PM, Michael Curtis wrote:
>> 
>>> 
>>> On 09/02/2011, at 2:17 AM, Samuel K. Gutierrez wrote:
>>> 
 Hi Michael,
 
 You may have tried to send some debug information to the list, but it 
 appears to have been blocked.  Compressed text output of the backtrace 
 text is sufficient.
>>> 
>>> 
>>> Odd, I thought I sent it to you directly.  In any case, here is the 
>>> backtrace and some information from gdb:
>>> 
>>> $ salloc -n16 gdb -args mpirun mpi
>>> (gdb) run
>>> Starting program: /mnt/f1/michael/openmpi/bin/mpirun 
>>> /mnt/f1/michael/home/ServerAdmin/mpi
>>> [Thread debugging using libthread_db enabled]
>>> 
>>> Program received signal SIGSEGV, Segmentation fault.
>>> 0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> 342 pdatorted[mev->sender.vpid]->state = ORTE_PROC_STATE_RUNNING;
>>> (gdb) bt
>>> #0  0x77b76869 in process_orted_launch_report (fd=-1, opal_event=1, 
>>> data=0x681170) at base/plm_base_launch_support.c:342
>>> #1  0x778a7338 in event_process_active (base=0x615240) at 
>>> event.c:651
>>> #2  0x778a797e in opal_event_base_loop (base=0x615240, flags=1) at 
>>> event.c:823
>>> #3  0x778a756f in opal_event_loop (flags=1) at event.c:730
>>> #4  0x7789b916 in opal_progress () at runtime/opal_progress.c:189
>>> #5  0x77b76e20 in orte_plm_base_daemon_callback (num_daemons=2) at 
>>> base/plm_base_launch_support.c:459
>>> #6  0x77b7bed7 in plm_slurm_launch_job (jdata=0x610560) at 
>>> plm_slurm_module.c:360
>>> #7  0x00403f46 in orterun (argc=2, argv=0x7fffe7d8) at 
>>> orterun.c:754
>>> #8  0x00402fb4 in main (argc=2, argv=0x7fffe7d8) at main.c:13
>>> (gdb) print pdatorted
>>> $1 = (orte_proc_t **) 0x67c610
>>> (gdb) print mev
>>> $2 = (orte_message_event_t *) 0x681550
>>> (gdb) print mev->sender.vpid
>>> $3 = 4294967295
>>> (gdb) print mev->sender
>>> $4 = {jobid = 1721696256, vpid = 4294967295}
>>> (gdb) print *mev
>>> $5 = {super = {obj_magic_id = 16046253926196952813, obj_class = 
>>> 0x77dd4f40, obj_reference_count = 1, cls_init_file_name = 
>>> 0x77bb9a78 "base/plm_base_launch_support.c", 
>>> cls_init_lineno = 423}, ev = 0x680850, sender = {jobid = 1721696256, vpid = 
>>> 4294967295}, buffer = 0x6811b0, tag = 10, file = 0x680640 
>>> "rml_oob_component.c", line = 279}
>> 
>> The jobid and vpid look like the defined INVALID values, indicating that 
>> something is quite wrong. This would quite likely lead to the segfault.
>> 
>>> From this, it would indeed appear that you are getting some kind of library 
>>> confusion - the most likely cause of such an error is a daemon from a 
>>> different version trying to respond, and so the returned message isn't 
>>> correct.
>> 
>> Not sure why else it would be happening...you could try setting -mca 
>> plm_base_verbose 5 to get more debug output displayed on your screen, 
>> assuming you built OMPI with --enable-debug.
>> 
> 
> Found the problem It is a site configuration issue, which I'll need to 
> find a workaround for.
> 
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Query of component [slurm] set 
> priority to 75
> [bio-ipc.{FQDN}:27523] mca:base:select:(  plm) Selected component [slurm]
> [bio-ipc.{FQDN}:27523] mca: base: close: component rsh closed
> [bio-ipc.{FQDN}:27523] mca: base: close: unloading component rsh
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: initial bias 27523 nodename 
> hash 1936089714
> [bio-ipc.{FQDN}:27523] plm:base:set_hnp_name: final jobfam 31383
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:base:receive start comm
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:slurm: launching job [31383,1]
> [bio-ipc.{FQDN}:27523] [[31383,0],0] plm:ba

[OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

Re: [OMPI users] Segmentation fault with SLURM and non-local nodes

9 matches

Site Navigation

Mail list logo

Footer information