[OMPI users] OpenMPI Checkpoint/Restart components

2009-12-06 Thread Andreea Costea
Hi there Lately I've been reading lots of papers about fault tolerance for MPI applications. All seemed very nice and clear. But as soon as I pass the reading part to start testing I had my surprise as there I can not find implementations. The best I could find is the possibility of manually check

[OMPI users] OpenMPI checkpoint/restart

2010-01-14 Thread Andreea Costea
Hei there I have some questions regarding checkpoint/restart: 1. Until recently I thought that ompi-restart and ompi-restart are used to checkpoint a process inside an MPI application. Now I reread thisand I realized that actually what it does

[OMPI users] Checkpoint/Restart error

2010-01-14 Thread Andreea Costea
Hi, I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have downloaded today. When I want to checkpoint I am having the following error message: [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 HNP with PID 2337 Not found! I tried the same thing with vers

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
Hi... still not working. Though I uninstalled OpenMPI with make uninstall and I manually deleted all other files, I still have the same error when checkpointing. Any idea? Thanks, Andreea On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: > On Jan 14, 2010, at 8:20 AM, Andreea Cos

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
works, ompi-checkpoint gives the following error: [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 HNP with PID 7899 Not found! I would appreciate any help, Andreea On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote: > Hi... > still not working. Though I unins

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
07 PM, Andreea Costea wrote: > I tried the new version, that was uploaded today. I still have that error, > just that now is at line 405 instead of 399. > > Maybe if I give more details: > - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, > ompi-checkpoint and ompi-r

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
It's almost midnight here, so I left home, but I will try it tomorrow. There were some directories left after "make uninstall". I will give more details tomorrow. Thanks Jeff, Andreea On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote: > On Jan 15, 2010, at 8:07 AM, An

Re: [OMPI users] Checkpoint/Restart error

2010-01-18 Thread Andreea Costea
Squyres wrote: > >> On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: >> >> > - I wanted to update to version 1.4.1 and I uninstalled previous version >> like this: make uninstall, and than manually deleted all the left over >> files. the directory where I install

Re: [OMPI users] Checkpoint/Restart error

2010-01-19 Thread Andreea Costea
mca_param_files snapc_base_global_snapshot_dir All 3 params differ because of the $HOME. One more thing: I don't have the directory $HOME/.openmpi Ideas? Thanks, Andreea On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea wrote: > Well... I decided to install a fresh OS to be sure that there is no OpenMPI &

Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Andreea Costea
previous mentioned error on root. Both root and guest show the same output after "param -all -all" except for the $HOME (which only matters for mca_component_path, mca_param_files, snapc_base_global_snapshot_dir) Thanks, Andreea On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costea wrote: >

[OMPI users] OpenMPI Suspend/Resume

2010-02-02 Thread Andreea Costea
Hi. Let's say I have an MPI application that runs on several hosts. I want to suspend the application. I do that by sending to the mpirun process the signal TSTP. Is there any way to measure how long does it take to the application to completely suspend? Doing this "time kill -TSTP PID" will measu

[OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-07 Thread Andreea Costea
Hi, Let's say I have an MPI application running on several hosts. Is there any way to checkpoint this application without having a shared storage between the nodes? I already took a look at the examples here http://www.osl.iu.edu/research/ft/ompi-cr/examples.php, but it seems that in both cases th

Re: [OMPI users] OpenMPI checkpoint/restart on multiple nodes

2010-02-08 Thread Andreea Costea
reea/checkpoints/local snapc_base_global_snapshot_dir=/home/andreea/checkpoints/global and the nodes can connect through ssh without a password. Thanks, Andreea On Mon, Feb 8, 2010 at 12:59 PM, Andreea Costea wrote: > Hi, > > Let's say I have an MPI application running on several hosts. Is