Re: [OMPI users] MPI debugger
On Sun, 10 Jan 2010 19:29:18 +, Ashley Pittman wrote: > It'll show you parallel stack traces but won't let you single step for > example. Two lightweight options if you want stepping, breakpoints, watchpoints, etc. * Use serial debuggers on some interesting processes, for example with mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n 1 xterm -e gdb --args ./trouble args to put an xterm on rank 0 and 3 of a four process job (there are lots of other ways to get here). * MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you to send the same gdb commands to each process and collate the output. Jed
[OMPI users] OpenMPI less fast than MPICH
Hi all I want to migrate my CFD application from MPICH-1.2.4 (ch_p4 device) to OpenMPI-1.4. Hence, I compared the two libraries compiled with my application and I noted OpenMPI is less efficient thant MPICH on ethernet (170min with MPICH against 200min with OpenMPI). So, I wonder if someone has more information/explanation. Here the configure options of OpenMPI: export FC=gfortran export F77=$FC export CC=gcc export PREFIX=/usr/local/bin/openmpi-1.4 ./configure --prefix=$PREFIX --enable-cxx-exceptions --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-cxx --enable-mpi-cxx-seek --enable-dist --enable-mpi-profile --enable-binaries --enable-cxx-exceptions --enable-mpi-threads --enable-memchecker --with-pic --with-threads --with-valgrind --with-libnuma --with-openib Despite my OpenMPI compilation supports OpenIB, I did not specified any mca/btl options because the machine does not have access to a Infiniband interconnect. So, I guess tcp, sm and self are used (or at least something close). Thank you for your help. Mathieu.
Re: [OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application
On Dec 12, 2009, at 10:03 AM, Kritiraj Sajadah wrote: Dear All, I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts. I run the application as follows: raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp m. Try setting the 'snapc_base_global_snapshot_dir' in your $HOME/.openmpi/mca-params.conf file instead of on the command line. This way it will be properly picked up by the ompi-restart commands. See the link below for how to do this: http://www.osl.iu.edu/research/ft/ompi-cr/examples.php#uc-ckpt-global and I trigger the checkpoint as follows: raj@sun32:~$ ompi-checkpoint -v 30010 The following happens displaying two errors which checkpointng the application: ## I am processor no 0 of a total of 2 procs on host sun32 I am processor no 1 of a total of 2 procs on host sun06 I am processo no 0 of a total of 2 procs on host sun32 I am processo no 1 of a total of 2 procs on host sun06 [sun32:30010] Error: expected_component: PID information unavailable! [sun32:30010] Error: expected_component: Component Name information unavailable! The only way this error could be generated when checkpointing (versus restarting) is if the Snapshot Coordinator failed to propagate the CRS component used so that it could be stored in the metadata. If this continues to happen try enabling debugging in the snapshot coordinator: mpirun -mca snapc_full_verbose 20 ... I am proceor no 1 of a total of 2 procs on host sun06 I am proceor no 0 of a total of 2 procs on host sun32 bye bye when I try to restart the application from the checkpointed file, I get the following: raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- I am proceor no 0 of a total of 2 procs on host sun32 bye This usually indicates that either: 1) The local checkpoint directory (opal_snapshot_1.ckpt) is missing. So the global checkpoint is either corrupted, or the node where rank 1 resided was not able to access the storage location (/tmp in your example). 2) You moved the ompi_global_snapshot_30010.ckpt directory from /tmp to somewhere else. Currently, manually moving the global checkpoint directory is not supported. -- Josh I would very appreciate if you could give me some ideas on how to checkpoint and restart MPI application running on multiple hosts. Thank you Regards, Raj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problem restarting multiprocess mpi application
On Dec 13, 2009, at 3:57 PM, Kritiraj Sajadah wrote: Dear All, I am running a simple mpi application which looks as follows: ## #include #include #include #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello\n"); sleep(15); printf("Hello again\n" ); sleep(15); printf("Final Hello\n"); sleep(15); printf("bye \n"); MPI_Finalize(); return 0; } # When I run my application as follows, it checkpoint correctly but when i try to restart it if gives the following errors: ## ompi-restart ompi_global_snapshot_380.ckpt Hello again [sun06:00381] *** Process received signal *** [sun06:00381] Signal: Bus error (7) [sun06:00381] Signal code: (2) [sun06:00381] Failing at address: 0xae7cb054 [sun06:00381] [ 0] [0xb7f8640c] [sun06:00381] [ 1] /home/raj/openmpisof/lib/libopen-pal.so. 0(opal_progress+0x123) [0xb7b95456] [sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcb093] [sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcae97] [sun06:00381] [ 4] /home/raj/openmpisof/lib/libopen-pal.so. 0(opal_crs_blcr_checkpoint+0x187) [0xb7bca69b] [sun06:00381] [ 5] /home/raj/openmpisof/lib/libopen-pal.so. 0(opal_cr_inc_core+0xc3) [0xb7b970bd] [sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0 [0xb7cab06f] [sun06:00381] [ 7] /home/raj/openmpisof/lib/libopen-pal.so. 0(opal_cr_test_if_checkpoint_ready+0x129) [0xb7b96fca] [sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7b97698] [sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b] [sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee] [sun06:00381] *** End of error message *** -- mpirun noticed that process rank 0 with PID 399 on node sun06 exited on signal 7 (Bus error). -- # This could be caused by a variety of things, including a bad BLCR installation. :/ Are you sure that your application was between MPI_Init() and MPI_Finalize() when you checkpointed? I am running it as follows: mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp mpisleepbas. Try specifying the MCA parameters in your $HOME/.openmpi/mca- params.conf file. Once a checkpoint it taken, I have to copy it to the home directory and try to restart it. The manual movement of the checkpoint file is not currently supported. I filed a bug about it if you want to track it: https://svn.open-mpi.org/trac/ompi/ticket/2161 please not that if i used - np 1, it works fine when i restart it. The problem is mainly when the application has more than one process running. Are the processes on the same machines or different machines? -- Josh Any help will be very appreciated Raj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] checkpoint opempi-1.3.3+sge62
On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote: Hi Reuti, Yes, I sent a job with SGE and I checkpointed the mpirun process, by hand, entering into the mpi master node. Then I killed the job with qdel and after that I did the ompi-restart. I will try to integrate with SGE creating a ckpt environment but I think that it could be a bit difficult because: 1 - when I do checkpoint I can't specify a directory with a name like checkpoint_jobid 2 - I can't specify the scratch directory and I have to use the /tmp instead of SGE's scratch directory. 3 - I tried to restart the snapshot and it only works if I use the same machinefile. That is, If the job ran in the c3-13 and c3-14, I have to restart the job using a machinefile with these two nodes. This is usually caused by prelink'ing interfering with BLCR. See the BLCR FAQ for how to disable this option: https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink Let me know if that fixes this problem. Josh [sdiaz@svgd ~]$ ompi-restart -v -machinefile mpi_test/lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt [svgd.cesga.es:28836] Checking for the existence of (/home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt) [svgd.cesga.es:28836] Restarting from file (ompi_global_snapshot_12554.ckpt) [svgd.cesga.es:28836]Exec in self tiempo 110 Process1 : compute-3-14.local of2 tiempo 110 Process0 : compute-3-13.local of2 -- mpirun noticed that process rank 1 with PID 8477 on node compute-3-15 exited on signal 11 (Segmentation fault). -- To solve problem 1, there is a feature opened by Josh. (https://svn.open-mpi.org/trac/ompi/ticket/2098 ) To solve problem 2, there is a thread in which is talked ([OMPI users] Changing location where checkpoints are saved) and also a bug opened by Josh. https://svn.open-mpi.org/trac/ompi/ticket/2139 . I think that it could work... we will see. To solve problem 3, I didn't have time to search it. But if Josh or anyone have an idea... please tell to us :-) Reuti, Did you test it successfully? How do you solve these problems? Regards, Sergio Reuti escribió: Hi, Am 14.12.2009 um 17:05 schrieb Sergio Díaz: I got a successful checkpoint with a fresh installation and without use the trunk. I can't understand why it is working now and before I could do a successful restart... Maybe there was something wrong in the openmpi installation and then the metadata was created in a wrong way. I will test it more and also I will test the trunk. Regards, Sergio [sdiaz@compute-3-13 ~]$ ompi-restart -machinefile mpi_test/ lanzar_pi3.sh.po3117822 ompi_global_snapshot_12554.ckpt tiempo 110 Process1 : compute-3-14.local of2 tiempo 110 Process0 : compute-3-13.local of2 tiempo 120 Process1 : compute-3-14.local of2 tiempo 120 Process0 : compute-3-13.local ... ... [sdiaz@compute-3-14 ~]$ ps auxf |grep sdiaz sdiaz26273 0.0 0.0 34676 1668 ?Ss 15:58 0:00 orted --daemonize -mca ess env -mca orte_ess_jobid 1739128832 -mca in a Tight Integration into SGE the daemon should get the argument --no-daemonize. Are you restarting a job on the command line, which ran before under SGE's supervision? -- Reuti orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 1739128832.0;tcp://192.168.4.148:45551 -mca mca_base_param_file_prefix ft-enable-cr -mca mca_base_param_file_path /opt/cesga/openmpi-1.3.3_bis/share/ openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz sdiaz26274 0.1 0.0 15984 504 ?Sl 15:58 0:00 \_ cr_restart /home/cesga/sdiaz/ompi_global_snapshot_12554.ckpt/0/ opal_snapshot_1.ckpt/ompi_blcr_context.26047 sdiaz26047 1.5 0.0 99460 3624 ?Sl 15:58 0:00 \_ ./pi3 [sdiaz@compute-3-13 ~]$ ps auxf |grep sdiaz root 12878 0.0 0.0 90260 3000 pts/0S15:55 0:00 | \_ su - sdiaz sdiaz12880 0.0 0.0 53432 1512 pts/0S15:55 0:00 | \_ -bash sdiaz13070 0.3 0.0 39988 2500 pts/0S+ 15:58 0:00 | \_ mpirun -am ft-enable-cr --default-hostfile mpi_test/lanzar_pi3.sh.po3117822 --app /home/cesga/sdiaz/ ompi_global_snap
Re: [OMPI users] checkpointing multi node and multi process applications
On Dec 19, 2009, at 7:42 AM, Jean Potsam wrote: Hi Everyone, I am trying to checkpoint an mpi application running on multiple nodes. However, I get some error messages when i trigger the checkpointing process. Error: expected_component: PID information unavailable! Error: expected_component: Component Name information unavailable! I am using open mpi 1.3 and blcr 0.8.1 Can you try the v1.4 release and see if the problem persists? I execute my application as follows: mpirun -am ft-enable-cr -np 3 --hostfile hosts gol. My question: Does openmpi with blcr support checkpointing of multi node execution of mpi application? If so, can you provide me with some information on how to achieve this. Open MPI is able to checkpoint a multi-node application (that's what it was designed to do). There are some examples at the link below: http://www.osl.iu.edu/research/ft/ompi-cr/examples.php -- Josh Cheers, Jean. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI w valgrind: need to recompile?
Hello Saurabh, On Wednesday 06 January 2010 11:20:55 am Saurabh T wrote: > I am building libraries against OpenMPI, and then applications using those > libraries. > > It was unclear from the FAQ at > http://www.open-mpi.org/faq/?category=debugging#memchecker_how whether the > libraries need to be recompiled and the application relinked using > valgrind-enabled mpicc etc, in order to get valgrind to work. An Open MPI using memchecker does not add or change anything to mpicc; there are no further libraries that need to be linked against. > In other words, can I run a valgrind-disabled openmpi app with a valgrind- > enabled orterun, or do I have to recompile/relink the whole thing? Is the > answer different for shared vs static openmpi libraries? Therefore, in case of a shared library: Yes, You can run the app compiled with a non-memchecker OpenMPI, and later alter the PATH/LD_LIBRARY_PATH to use the memchecker/valgrind-enabled Open MPI... (although, this is not the common and suggested practice ,-)) And yes, it would be different for statically linking against libmpi.a (which in case of not using --enable-memchecker would just not invoke the valgrind- API). > The FAQ also states that openmpi from v 1.5 provides a valgrind suppression > file. Is this a mistake in the FAQ or is the suppression file not available > with the latest stable release (1.4)? If not, can the 1.5 file be used with > 1.4? That's a good point! Please see the CMR ticket #2162 Best regards, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink