MPI from all
> machines on your cluster, before installing the new version? Sometimes
> problems like this come up because of mismatches in Open MPI versions on a
> machine.
>
> -- Josh
>
>
> On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:
>
> I met the same probl
ue, Mar 23, 2010 at 12:55 PM, fengguang tian
>> wrote:
>>
>>>
>>> I use mpirun -np 50 -am ft-enable-cr --mca snapc_base_global_snapshot_dir
>>> --hostfile .mpihostfile
>>> to store the global checkpoint snapshot into the shared
>>> di
I met the same problem with this link:
http://www.open-mpi.org/community/lists/users/2009/12/11374.php
in the link, they give a solution that use v1.4 open mpi instead of v1.3
open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
here is what I have done:
my cluster composed o
.ckpt/0/opal_snapshot_4.ckpt), mkdir
failed [1]
[nimbus1:12630] Error: No metadata filename specified!
why is that?
cheers
fengguang
On Tue, Mar 23, 2010 at 10:37 AM, Fernando Lemos wrote:
> On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian
> wrote:
> > Hi
> >
> > I am usin
invalid filename.
Please see --help for usage.
--
cheers
fengguang
On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote:
> On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian
> wrote:
> > I have created the
OK,thank you. I will try to move the checkpoint file into the shared
directory
Regards
fengguang
On Tue, Mar 23, 2010 at 10:34 AM, Fernando Lemos wrote:
> On Tue, Mar 23, 2010 at 12:27 PM, fengguang tian
> wrote:
> > I have created the shared file system. but I created a /mi
I have created the shared file system. but I created a /mirror at root
directory,not at the $HOME directory,is that the
problem? thank you
cheers
fengguang
On Tue, Mar 23, 2010 at 10:23 AM, Fernando Lemos wrote:
> On Mon, Mar 22, 2010 at 8:20 PM, fengguang tian
> wrote:
> > I set
Hi
I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint
and restart work fine in single machine,but when doing checkpoint in
clusters environment, the ompi-checkpoint hangs
for example
my clusters composed of 3 machines, and using NFS, has a shared directory.
in master node
I set up a cluster of 18 nodes using Open MPI and BLCR library, and the MPI
program runs well on the clusters,
but how to checkpoint the MPI program on this clusters?
for example:
here is what I do for a test:
mpiu@nimbus: /mirror$ mpirun -np 50 --hostfile .mpihostfile -am ft-enable-cr
hellompi
th