Re: [OMPI users] MPI debugger

2010-01-11 Thread Jed Brown
On Sun, 10 Jan 2010 19:29:18 +, Ashley Pittman  wrote:
> It'll show you parallel stack traces but won't let you single step for
> example.

Two lightweight options if you want stepping, breakpoints, watchpoints,
etc.

* Use serial debuggers on some interesting processes, for example with

mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n 
1 xterm -e gdb --args ./trouble args

  to put an xterm on rank 0 and 3 of a four process job (there are lots
  of other ways to get here).

* MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you
  to send the same gdb commands to each process and collate the output.

Jed


[OMPI users] OpenMPI less fast than MPICH

2010-01-11 Thread Mathieu Gontier





Hi all

I want to migrate my CFD application from MPICH-1.2.4 (ch_p4 device) to
OpenMPI-1.4. Hence, I compared the two libraries compiled with my
application and I noted OpenMPI is less efficient thant MPICH on
ethernet (170min with MPICH against 200min with OpenMPI). So, I wonder
if someone has more information/explanation.

Here the configure options of OpenMPI:

export FC=gfortran
export F77=$FC
export CC=gcc
export PREFIX=/usr/local/bin/openmpi-1.4
./configure --prefix=$PREFIX --enable-cxx-exceptions --enable-mpi-f77
--enable-mpi-f90 --enable-mpi-cxx --enable-mpi-cxx-seek --enable-dist
--enable-mpi-profile --enable-binaries --enable-cxx-exceptions
--enable-mpi-threads --enable-memchecker --with-pic --with-threads
--with-valgrind --with-libnuma --with-openib

Despite my OpenMPI compilation supports OpenIB, I did not specified any
mca/btl options because the machine does not have access to a
Infiniband interconnect. So, I guess tcp, sm and self are used (or at
least something close).

Thank you for your help.
Mathieu.





Re: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application

2010-01-11 Thread Josh Hursey


On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote:


Dear All,
I am trying to checkpoint am MPI application which has two  
processes each running on two seperate hosts.


I run the application as follows:

raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca  
btl ^openib -mca snapc_base_global_snapshot_dir /tmp m.


Try setting the 'snapc_base_global_snapshot_dir' in your  
$HOME/.openmpi/mca-params.conf file instead of on the command line.  
This way it will be properly picked up by the ompi-restart commands.


See the link below for how to do this:
  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global



and I trigger the checkpoint as follows:

raj@sun32:~$ ompi-checkpoint -v 30010


The following happens displaying two errors which checkpointng the  
application:



##
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processo no 0 of a total of 2 procs on host sun32
I am processo no 1 of a total of 2 procs on host sun06

[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information  
unavailable!


The only way this error could be generated when checkpointing (versus  
restarting) is if the Snapshot Coordinator failed to propagate the CRS  
component used so that it could be stored in the metadata. If this  
continues to happen try enabling debugging in the snapshot coordinator:

 mpirun -mca snapc_full_verbose 20 ...



I am proceor no 1 of a total of 2 procs on host sun06
I am proceor no 0 of a total of 2 procs on host sun32
bye
bye





when I try to restart the application from the checkpointed file, I  
get the following:


raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either  
you have not provided a filename

  or provided an invalid filename.
  Please see --help for usage.

--
I am proceor no 0 of a total of 2 procs on host sun32
bye


This usually indicates that either:
 1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing.  
So the global checkpoint is either corrupted, or the node where rank 1  
resided was not able to access the storage location (/tmp in your  
example).
 2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp  
to somewhere else. Currently, manually moving the global checkpoint  
directory is not supported.


-- Josh




I would very appreciate if you could give me some ideas on how to  
checkpoint and restart MPI application running on multiple hosts.


Thank you

Regards,

Raj



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] problem restarting multiprocess mpi application

2010-01-11 Thread Josh Hursey


On Dec 13, 2009, at 3:57 PM, Kritiraj Sajadah wrote:


Dear All,
   I am running a simple mpi application which looks as  
follows:


##

#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int rank,size;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello\n");
sleep(15);
printf("Hello again\n" );
sleep(15);
printf("Final Hello\n");
sleep(15);
printf("bye \n");
MPI_Finalize();
return 0;
}
#

When I run my application as follows, it checkpoint correctly but  
when i try to restart it if gives the following errors:


##

ompi-restart ompi_global_snapshot_380.ckpt
Hello again
[sun06:00381] *** Process received signal ***
[sun06:00381] Signal: Bus error (7)
[sun06:00381] Signal code:  (2)
[sun06:00381] Failing at address: 0xae7cb054
[sun06:00381] [ 0] [0xb7f8640c]
[sun06:00381] [ 1] /home/raj/openmpisof/lib/libopen-pal.so. 
0(opal_progress+0x123) [0xb7b95456]
[sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0  
[0xb7bcb093]
[sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0  
[0xb7bcae97]
[sun06:00381] [ 4] /home/raj/openmpisof/lib/libopen-pal.so. 
0(opal_crs_blcr_checkpoint+0x187) [0xb7bca69b]
[sun06:00381] [ 5] /home/raj/openmpisof/lib/libopen-pal.so. 
0(opal_cr_inc_core+0xc3) [0xb7b970bd]
[sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0  
[0xb7cab06f]
[sun06:00381] [ 7] /home/raj/openmpisof/lib/libopen-pal.so. 
0(opal_cr_test_if_checkpoint_ready+0x129) [0xb7b96fca]
[sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0  
[0xb7b97698]

[sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b]
[sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee]
[sun06:00381] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 399 on node sun06 exited  
on signal 7 (Bus error).

--
#


This could be caused by a variety of things, including a bad BLCR  
installation. :/


Are you sure that your application was between MPI_Init() and  
MPI_Finalize() when you checkpointed?



I am running it as follows:


mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca  
snapc_base_global_snapshot_dir /tmp mpisleepbas.




Try specifying the MCA parameters in your $HOME/.openmpi/mca- 
params.conf file.




Once a checkpoint it taken, I have to copy it to the home directory  
and try to restart it.


The manual movement of the checkpoint file is not currently supported.  
I filed a bug about it if you want to track it:

  https://svn.open-mpi.org/trac/ompi/ticket/2161



please not that if i used - np 1, it works fine when i restart it.  
The problem is mainly when the application has more than one process  
running.


Are the processes on the same machines or different machines?

-- Josh




Any help will be very appreciated


Raj






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] checkpoint opempi-1.3.3+sge62

2010-01-11 Thread Josh Hursey


On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote:


Hi Reuti,

Yes, I sent a job with SGE and I checkpointed the mpirun process, by  
hand, entering into the mpi master node. Then I killed the job with  
qdel and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I  
think that it could be a bit difficult because:
 1 -  when I do checkpoint I can't specify a directory with  
a name like checkpoint_jobid
 2 -  I can't specify the scratch directory and I have to  
use the /tmp instead of SGE's scratch directory.
 3 -  I tried to restart the snapshot and it only works if I  
use the same machinefile. That is, If the job ran in the c3-13 and  
c3-14, I have to restart the job using a machinefile with these two  
nodes.


This is usually caused by prelink'ing interfering with BLCR. See the  
BLCR FAQ for how to disable this option:

  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

Let me know if that fixes this problem.

Josh



[sdiaz@svgd ~]$ ompi-restart -v -machinefile  
mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt
[svgd.cesga.es:28836] Checking for the existence  
of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt)
[svgd.cesga.es:28836] Restarting from file  
(ompi_global_snapshot_12554.ckpt)

[svgd.cesga.es:28836]Exec in self
 tiempo  110
 Process1 :
 compute-3-14.local
of2
 tiempo  110
 Process0 :
 compute-3-13.local
   of2
 
--
mpirun noticed that process rank 1 with PID  
8477 on node compute-3-15 exited on signal 11 (Segmentation fault).
 
--


To solve problem 1, there is a feature opened by Josh. (https://svn.open-mpi.org/trac/ompi/ticket/2098 
)
To solve problem 2, there is a thread in which is talked ([OMPI  
users] Changing location where checkpoints are saved) and also a bug  
opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I  
think that it could work... we will see.
To solve problem 3, I didn't have time to search it. But if Josh or  
anyone have an idea... please tell to us :-)


Reuti, Did you test it successfully? How do you solve these problems?

Regards,
Sergio


Reuti escribió:


Hi,

Am 14.12.2009 um 17:05 schrieb Sergio Díaz:

I got a successful checkpoint with a fresh installation and  
without use the trunk. I can't understand why it is working now  
and before I could do a successful restart... Maybe there was  
something wrong in the openmpi installation and then the metadata  
was created in a wrong way.

I will test it more and also I will test the trunk.

Regards,
Sergio

[sdiaz@compute-3-13 ~]$ ompi-restart -machinefile mpi_test/ 
lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt

 tiempo  110
 Process1 :
 compute-3-14.local
 of2
 tiempo  110
 Process0 :
 compute-3-13.local
 of2
 tiempo  120
 Process1 :
 compute-3-14.local
 of2
 tiempo  120
 Process0 :
 compute-3-13.local
...
...

[sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz
sdiaz26273  0.0  0.0 34676 1668 ?Ss   15:58   0:00  
orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca


in a Tight Integration into SGE the daemon should get the argument  
--no-daemonize. Are you restarting a job on the command line, which  
ran before under SGE's supervision?


-- Reuti 



orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
1739128832.0;tcp://192.168.4.148:45551 -mca  
mca_base_param_file_prefix ft-enable-cr -mca  
mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/ 
openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz -mca  
mca_base_param_file_path_force /home_no_usc/cesga/sdiaz
sdiaz26274  0.1  0.0 15984  504 ?Sl   15:58   0:00  \_  
cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/ 
opal_snapshot_1.ckpt/ompi_blcr_context.26047
sdiaz26047  1.5  0.0 99460 3624 ?Sl   15:58
0:00  \_ ./pi3


[sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz
root 12878  0.0  0.0 90260 3000 pts/0S15:55   0:00   
|   \_ su - sdiaz
sdiaz12880  0.0  0.0 53432 1512 pts/0S15:55   0:00   
|   \_ -bash
sdiaz13070  0.3  0.0 39988 2500 pts/0S+   15:58   0:00   
|   \_ mpirun -am ft-enable-cr --default-hostfile  
mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/ 
ompi_global_snap

Re: [OMPI users] checkpointing multi node and multi process applications

2010-01-11 Thread Josh Hursey


On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote:


Hi Everyone,
   I am trying to checkpoint an mpi application  
running on multiple nodes. However, I get some error messages when i  
trigger the checkpointing process.


Error: expected_component: PID information unavailable!
Error: expected_component: Component Name information unavailable!

I am using  open mpi 1.3 and blcr 0.8.1


Can you try the v1.4 release and see if the problem persists?



I execute my application as follows:

mpirun -am ft-enable-cr -np 3 --hostfile hosts gol.

My question:

Does openmpi with blcr support checkpointing of multi node execution  
of mpi application? If so, can you provide me with some information  
on how to achieve this.


Open MPI is able to checkpoint a multi-node application (that's what  
it was designed to do). There are some examples at the link below:

  http://www.osl.iu.edu/research/ft/ompi-cr/examples.php

-- Josh



Cheers,

Jean.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI w valgrind: need to recompile?

2010-01-11 Thread Rainer Keller
Hello Saurabh,
On Wednesday 06 January 2010 11:20:55 am Saurabh T wrote:
> I am building libraries against OpenMPI, and then applications using those
> libraries.
>
> It was unclear from the FAQ at
> http://www.open-mpi.org/faq/?category=debugging#memchecker_how whether the
> libraries need to be recompiled and the application relinked using
> valgrind-enabled mpicc etc, in order to get valgrind to work.
An Open MPI using memchecker does not add or change anything to mpicc; there 
are no further libraries that need to be linked against.

> In other words, can I run a valgrind-disabled openmpi app with a valgrind-
> enabled orterun, or do I have to recompile/relink the whole thing? Is the
> answer different for shared vs static openmpi libraries?
Therefore, in case of a shared library: Yes, You can run the app compiled with 
a non-memchecker OpenMPI, and later alter the PATH/LD_LIBRARY_PATH to use the 
memchecker/valgrind-enabled Open MPI... (although, this is not the common and 
suggested practice ,-))

And yes, it would be different for statically linking against libmpi.a (which 
in case of not using --enable-memchecker would just not invoke the valgrind-
API).


> The FAQ also states that openmpi from v 1.5 provides a valgrind suppression 
> file. Is this a mistake in the FAQ or is the suppression file not available
> with the latest stable release (1.4)? If not, can the 1.5 file be used with
> 1.4?
That's a good point!
Please see the CMR ticket #2162

Best regards,
Rainer
-- 

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink