We have the checkpoint/restart working now. Turns out that the BLCR kernel mods were installed incorrectly.
Thanks for the help. -Wayne -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Josh Hursey Sent: Monday, January 28, 2008 6:57 PM To: Open MPI Users Subject: Re: [OMPI users] (no subject) I'm unable to reproduce this problem. :( I tried both the svn head (r17288) and the tarball that you were using (openmpi-1.3a1r17175) on a similar system without problem. The error you are seeing may be caused by old connectivity information in the session directory. You may want to make sure that / tmp does not contain any "openmpi-session*" directories before starting mpirun. Other than that you may want to try a clean build of Open MPI just to make sure that you are not seeing anything odd resulting from old Open MPI install files. let me know if that helps. -- Josh On Jan 24, 2008, at 12:38 PM, Wong, Wayne wrote: > I'm having some difficulty geting the Open MPI checkpoint/restart > fault tolerance working. I have compiled Open MPI with the "--with- > ft=cr" flag, but when I attempt to run my test program (ring), the > ompi-checkpoint command fails. I have verified that the test program > works fine without the fault tolerance enabled. Here are the details: > > [me@dev1 ~]$ mpirun -np 4 -am ft-enable-cr ring > [me@dev1 ~]$ ps -efa | grep mpirun > me 3052 2820 1 08:25 pts/2 00:00:00 mpirun -np 4 -am > ft-enable-cr ring > > > [me@dev1 ~]$ ompi-checkpoint 3052 > [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown > error: 5854512 in file sds_singleton_module.c at line 50 > [dev1.acme.local:03060] [NO-NAME] ORTE_ERROR_LOG: Unknown > error: 5854512 in file runtime/orte_init.c at line 311 > > ---------------------------------------------------------------------- > ---- > It looks like orte_init failed for some reason; your parallel > process is > likely to abort. There are many reasons that a parallel process > can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal > failure; > here's some additional information (which may only be relevant to > an > Open MPI developer): > > orte_sds_base_set_name failed > --> Returned value Unknown error: 5854512 (5854512) instead of > ORTE_SUCCESS > > > ---------------------------------------------------------------------- > ---- > Any help would be appreciated. Thanks. > <ompi_info.txt.gz><config.log.gz> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users