Hi
I was trying out the staging option in checkpoint where I save the
checkpoint image in local file system and have the image transferred to
global filesystem in the background. As part of the background process I
see that the "scp" command is launched to transfer the images from local
file sys
Hi
If I replace MPI_Bcast() with a paired MPI_Send() and MPI_Recv() calls,
what kind of impact does it have on the performance of the program? Are
there any benchmarks of MPI_Bcast() vs paired MPI_Send() and
MPI_Recv()??
Thanks
Ananda
Please do not print this email unless it is absolutely
Josh
I have few more observations that I want to share with you.
I modified the earlier C program little bit by making two MPI_Bcast() calls
inside while loop for 10 seconds. The issue of MPI_Bcast() failing with
ERR_TRUNCATE error message resurfaces when I call checkpoint on this program.
Int
Josh
Thanks for addressing the issue. I will try the new version that has
your fix and let you know.
BTW, I have been in touch with mpi4py team also to debug this issue.
According to mpi4py team, MPI_Bcast() is implemented with two collective
calls: First one with MPI_Bcast() of single intege
Josh
I have one more update on my observation while analyzing this issue.
Just to refresh, I am using openmpi-trunk release 23596 with
mpi4py-1.2.1 and BLCR 0.8.2. When I checkpoint the python script written
using mpi4py, the program doesn't progress after the checkpoint is taken
successfully
Josh
I tried running the mpi4py program with the latest trunk version of
openmpi. I have compiled openmpi-1.7a1r23596 from trunk and recompiled
mpi4py to use this library. Unfortunately I see the same behavior as I
have seen with openmpi 1.4.2 ie; checkpoint will be successful but the
program does
OK, I will do that.
But did you try this program on a system where the latest trunk is
installed? Were you successful in checkpointing?
- Ananda
-Original Message-
Message: 9
List-Post: users@lists.open-mpi.org
Date: Fri, 13 Aug 2010 10:21:29 -0400
From: Joshua Hursey
Subject: Re: [OMPI
Josh
I have stack traces of all 8 python processes when I observed the hang after
successful completion of checkpoint. They are in the attached document. Please
see if these stack traces provide any clue.
Thanks
Ananda
From: Ananda Babu Mudar (WT01 - Energy an
Josh
I am having problems compiling the sources from the latest trunk. It
complains of libgomp.spec missing even though that file exists on my
system. I will see if I have to change any other environment variables
to have a successful compilation. I will keep you posted.
BTW, were you successful
Josh
Please find attached is the python program that reproduces the hang that
I described. Initial part of this file describes the prerequisite
modules and the steps to reproduce the problem. Please let me know if
you have any questions in reproducing the hang.
Please note that, if I add the foll
Hi
I have integrated mpi4py with openmpi 1.4.2 that was built with BLCR
0.8.2. When I run ompi-checkpoint on the program written using mpi4py, I
see that program doesn't resume sometimes after successful checkpoint
creation. This doesn't occur always meaning the program resumes after
successful
That's correct. I have prefixed them with OMPI_MCA_ when I defined them
in my environment. Despite that I still see some of these files being
created under the default directory /tmp which is different from what I
had set.
Thanks
Ananda
From: Josh Hursey
Subject
Hi
I am using open mpi v1.3.4 with BLCR 0.8.2. I have been testing my
openmpi based program on a 3-node cluster (each node is a Intel Nehalem
based dual quad core) and I have been successful in checkpointing and
restarting the program successfully multiple times.
Recently I moved to a 15 node
Ralph
Defining these parameters in my environment also did not resolve the
problem. Whenever I restart my program, the temporary files are getting
stored in the default /tmp directory instead of the directory I had
defined.
Thanks
Ananda
=
Subject: Re: [OMPI users] opal_cr_tmp_
Ralph
When you say manually, do you mean setting these parameters in the
command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or
is there another way to set these parameters?
Thanks
Ananda
==
Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at
Ralph
When you say manually, do you mean setting these parameters in the
command line while calling mpirun, ompi-restart, and ompi-checkpoint? Or
is there another way to set these parameters?
Thanks
Ananda
==
Subject: Re: [OMPI users] opal_cr_tmp_dir
From: Ralph Castain (rhc_at
Ralph
I have these parameters set in ~/.openmpi/mca-params.conf file
$ cat ~/.openmpi/mca-params.conf
orte_tmpdir_base = /home/ananda/ORTE
opal_cr_tmp_dir = /home/ananda/OPAL
$
Should I be setting OMPI_MCA_opal_cr_tmp_dir?
FYI, I am using openmpi 1.3.4 with blcr 0.8.2
Thanks
Ananda
=
Thanks Ralph.
Another question. Even though I am setting opal_cr_tmp_dir to a
directory other than /tmp while calling ompi-restart command, this
setting is not getting passed to the mpirun command that gets generated
by ompi-restart. How do I overcome this constraint?
Thanks
Ananda
==
I am setting the MCA parameter "opal_cr_tmp_dir" to a directory other
than /tmp while calling "mpirun", "ompi-restart", and "ompi-checkpoint"
commands so that I don't fill up /tmp filesystem. But I see that
openmpi-sessions* directory is still getting created under /tmp. How do
I overcome this prob
Hi
I am using open-mpi 1.3.4 with BLCR. Sometimes I am running into a
strange problem with ompi-checkpoint command. Even though I see that all
MPI processes (equal to np argument) are running, ompi-checkpoint
command fails at times. I have seen this failure always when the MPI
processes spawned
The description for MCA parameter "opal_cr_use_thread" is very short at
URL: http://osl.iu.edu/research/ft/ompi-cr/api.php
Can someone explain the usefulness of enabling this parameter vs
disabling it? In other words, what are pros/cons of disabling it?
I found that this gets enabled automa
Hi
If the run my compute intensive openmpi based program using regular
invocation of mpirun (ie; mpirun -host -np ), it gets completed in few seconds but if I run the same program
with "-am ft-enable-cr" option, the program takes 10x time to complete.
If I enable hyperthreading on my cluster
When I checkpoint my openmpi application using ompi_checkpoint, I see
that top command suddenly shows some really huge numbers in "CPU %"
field such as 150% 200% etc. After sometime, these numbers do come back
to the normal numbers under 100%. This happens exactly around the time
checkpoint is comp
I am observing a very strange performance issue with my openmpi program.
I have compute intensive openmpi based application that keeps the data
in memory, process the data and then dumps it to GPFS parallel file
system. GPFS parallel file system server is connected to a QDR
infiniband switch fro
24 matches
Mail list logo