We have the checkpoint/restart working now.  Turns out that the BLCR
kernel mods were installed incorrectly.

Thanks for the help.

-Wayne

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Josh Hursey
Sent: Monday, January 28, 2008 6:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] (no subject)

I'm unable to reproduce this problem. :( I tried both the svn head
(r17288) and the tarball that you were using (openmpi-1.3a1r17175) on a
similar system without problem.

The error you are seeing may be caused by old connectivity information
in the session directory. You may want to make sure that / tmp does not
contain any "openmpi-session*" directories before starting mpirun.

Other than that you may want to try a clean build of Open MPI just to
make sure that you are not seeing anything odd resulting from old Open
MPI install files.

let me know if that helps.

-- Josh

On Jan 24, 2008, at 12:38 PM, Wong, Wayne wrote:

> I'm having some difficulty geting the Open MPI checkpoint/restart 
> fault tolerance working.  I have compiled Open MPI with the "--with- 
> ft=cr" flag, but when I attempt to run my test program (ring), the 
> ompi-checkpoint command fails.  I have verified that the test program 
> works fine without the fault tolerance enabled.  Here are the details:
>
>      [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring
>      [me@dev1 ~]$ ps -efa | grep mpirun
>      me     3052  2820  1 08:25 pts/2    00:00:00 mpirun -np 4 -am  
> ft-enable-cr ring
>
>
>      [me@dev1 ~]$ ompi-checkpoint 3052
>      [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown
> error: 5854512 in file sds_singleton_module.c at line 50
>      [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown
> error: 5854512 in file runtime/orte_init.c at line 311
>       
> ----------------------------------------------------------------------
> ----
>      It looks like orte_init failed for some reason; your parallel 
> process is
>      likely to abort.  There are many reasons that a parallel process 
> can
>      fail during orte_init; some of which are due to configuration or
>      environment problems.  This failure appears to be an internal 
> failure;
>      here's some additional information (which may only be relevant to

> an
>      Open MPI developer):
>
>        orte_sds_base_set_name failed
>        --> Returned value Unknown error: 5854512 (5854512) instead of 
> ORTE_SUCCESS
>
>       
> ----------------------------------------------------------------------
> ----
> Any help would be appreciated.  Thanks.
> <ompi_info.txt.gz><config.log.gz>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to