On Dec 14, 2009, at 12:25 PM, Sergio Díaz wrote:
Hi Reuti,
Yes, I sent a job with SGE and I checkpointed the mpirun process, by
hand, entering into the mpi master node. Then I killed the job with
qdel and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt e
Hi,
Thanks Reuti. These links were very useful when I did the integration of
BLCR with SGE. I will review them to check if there is more useful
information.
Regards,
Sergio
Reuti escribió:
Hi,
no, I never tried Open MPI's checkpointing. But there are two Howto's
from which you may get som
Hi,
no, I never tried Open MPI's checkpointing. But there are two Howto's
from which you may get some ideas to integrate it with SGE:
http://gridengine.sunsource.net/howto/checkpointing.html
http://gridengine.sunsource.net/howto/APSTC-TB-2004-005.pdf (but Open
MPI's checkpointing seems more
Hi Reuti,
Yes, I sent a job with SGE and I checkpointed the mpirun process, by
hand, entering into the mpi master node. Then I killed the job with qdel
and after that I did the ompi-restart.
I will try to integrate with SGE creating a ckpt environment but I think
that it could be a bit difficu
Hi,
Am 14.12.2009 um 17:05 schrieb Sergio Díaz:
I got a successful checkpoint with a fresh installation and without
use the trunk. I can't understand why it is working now and before
I could do a successful restart... Maybe there was something wrong
in the openmpi installation and then the
Hi Josh,
I got a successful checkpoint with a fresh installation and without use
the trunk. I can't understand why it is working now and before I could
do a successful restart... Maybe there was something wrong in the
openmpi installation and then the metadata was created in a wrong way.
I wi
Hi Josh
Here you go the file.
I will try to apply the trunk but I think that I broke-up my openmpi
installation doing "something" and I don't know what :-( . I was
modifying the mca parameters...
When I send a job, the orted daemon expanded in the SLAVE host is
launched in a bucle till they s
On Nov 12, 2009, at 10:54 AM, Sergio Díaz wrote:
Hi Josh,
You were right. The main problem was the /tmp. SGE uses a scratch
directory in which the jobs have temporary files. Setting TMPDIR to /
tmp, checkpoint works!
However, when I try to restart it... I got the following error (see
ERRO
Hi Josh,
You were right. The main problem was the /tmp. SGE uses a scratch
directory in which the jobs have temporary files. Setting TMPDIR to
/tmp, checkpoint works!
However, when I try to restart it... I got the following error (see
ERROR1). Option -v agrees these lines (see ERRO2).
I was
On Nov 9, 2009, at 5:33 AM, Sergio Díaz wrote:
> Hi Josh,
>
> The OpenMPI version is 1.3.3.
>
> The command ompi-ps doesn't work.
>
> [root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
> [root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
> [compute-3-18.local:16254] orte_ps: Acquiring list
Hi Josh,
The OpenMPI version is 1.3.3.
The command ompi-ps doesn't work.
[root@compute-3-18 ~]# ompi-ps -j 2726959 -p 16241
[root@compute-3-18 ~]# ompi-ps -v -j 2726959 -p 16241
[compute-3-18.local:16254] orte_ps: Acquiring list of HNPs and setting
contact info into RML...
[root@compute-3-18
On Oct 28, 2009, at 7:41 AM, Sergio Díaz wrote:
Hello,
I have achieved the checkpoint of an easy program without SGE. Now,
I'm trying to do the integration openmpi+sge but I have some
problems... When I try to do checkpoint of the mpirun PID, I got an
error similar to the error gotten wh
I am having the same problem when I want to checkpoint manually: "HNP with PID
Not found!", though I am sure I put the right PID
--- On Mon, 11/2/09, Sergio Díaz wrote:
From: Sergio Díaz
Subject: Re: [OMPI users] checkpoint opempi-1.3.3+sge62
To: "Open MPI Users&quo
Hi again,
I found a C program to test ompi-checkpoint/restart an it works fine.
The program was written by Alan Woodland and shared in the following
distribution list: debian-bugs-d...@lists.debian.org
This program starts a countdown from 10 to 0 and when the countdown is
6, do a checkpoint, k
Hello,
I have achieved the checkpoint of an easy program without SGE. Now, I'm
trying to do the integration openmpi+sge but I have some problems...
When I try to do checkpoint of the mpirun PID, I got an error similar to
the error gotten when the PID doesn't exit. The example below.
Any idea
15 matches
Mail list logo