thank you very much!
The option -mca orte_heartbeat_rate N is very usefull do detect failures like
host or network failed or orted deamon killed for the running mpi job.
I have another question:
I use ssh for openmpi remote connect, but sometimes a host doesn't answer ssh
login request,
Outstanding. I'll have two.
Damien
George Bosilca wrote:
The Open MPI Team, representing a consortium of bailed-out banks, car
manufacturers, and insurance companies, is pleased to announce the
release of the "unbreakable" / bug-free version Open MPI 2009,
(expected to be available by mid-2011
The Open MPI Team, representing a consortium of bailed-out banks, car
manufacturers, and insurance companies, is pleased to announce the
release of the "unbreakable" / bug-free version Open MPI 2009,
(expected to be available by mid-2011). This release is essentially a
complete rewrite of Open MP
On Mar 31, 2009, at 4:21 PM, Gus Correa wrote:
Please, correct my argument below if I am wrong.
I am not sure yet if the problem is caused by libtool,
because somehow it was not present in OpenMPI 1.2.8.
Just as a comparison, the libtool commands on 1.2.8 and 1.3 are very
similar, although 1.2.
On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain wrote:
> So I gather that by "direct" you mean that you don't get an allocation from
> Maui before running the job, but for the other you do? Otherwise, OMPI
> should detect the that it is running under Torque and automatically use the
> Torque launche
On Apr 1, 2009, at 12:42 PM, Dave Love wrote:
Josh Hursey writes:
The configure flag that you are looking for is:
--with-ft=cr
Is there a good reason why --with-blcr doesn't imply it?
Not really. Though it is most likely difficult to make it happen given
the configure logic in Open MPI
Rolf Vandevaart writes:
> No, orte_leave_session_attached is needed to avoid the errno=2 errors
> from the sm btl. (It is fixed in 1.3.2 and trunk)
[It does cause other trouble, but I forget what the exact behaviour was
when I lost it as a default.]
>> Yes, but there's a problem with the recomm
Josh Hursey writes:
> The configure flag that you are looking for is:
> --with-ft=cr
Is there a good reason why --with-blcr doesn't imply it?
> You may also want to consider using the thread options too for
> improved C/R response:
> --enable-mpi-threads --enable-ft-thread
Incidentally, the
Thanks.
$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun --display-allocation --display-map -v -np
$NSLOTS --host node0001,node0002 hostname
$ cat HPL_8cpu_GB.o46
== ALLOCATED NODES
Thanks. I've tried your suggestion.
$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun -mca ras_gridengine_verbose 100 -v -np $NSLOTS
--host node0001,node0002 hostname
It allocated 2 nodes to run, however all
Hi guys, I try to repost my question...
I've a problem with the last stable build and the last nightly snapshot.
When I run a job directly with mpirun no problem.
If I try to submit it with lsf:
bsub -a openmpi -m grid01 mpirun.lsf /mnt/ewd/mpi/fibonacci/fibonacci_mpi
I get the follow error:
mpir
Ick is the proper response. :-)
The old 1.2 series would attempt to spawn a local orted on each of
those nodes, and that is what is failing. Best guess is that it is
because pbsdsh doesn't fully replicate a key part of the environment
that is expected.
One thing you could try is do this w
Ok this is weird, and the correct answer is probably "don't do that",
Anyway:
User wants to run many many small jobs, faster than our scheduler
+torque can start, he uses pbsdsh to start them in parallel, under tm.
pbsdsh bash -c 'cd $PBS_O_WORKDIR/$PBS_VNODENUM; mpirun -np 1
application'
Rolf has correctly reminded me that display-allocation occurs prior to
host filtering, so you will see all of the allocated nodes. You'll see
the impact of the host specifications in display-map,
Sorry for the confusion - thanks to Rolf for pointing it out.
Ralph
On Apr 1, 2009, at 7:40 AM,
Hi Ralph,
unfortunately, in this machine i can't upgrade OpenMPI at the moment.
Is there a way to limit or to reduce the probability of this error?
2009/4/1 Ralph Castain :
> Hi Gabriele
>
> I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to
> that size due to a requireme
Hi Gabriele
I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very
well to that size due to a requirement that the underlying out-of-band
system fully connect at the TCP level. Thus, every process in your job
will be opening 2002 sockets (one to every other process, one to the
As an FYI: you can debug allocation issues more easily by:
mpirun --display-allocation --do-not-launch -n 1 foo
This will read the allocation, do whatever host filtering you specify
with -host and -hostfile options, report out the result, and then
terminate without trying to launch anything.
There is indeed a heartbeat mechanism you can use - it is "off" by
default. You can set it to check every N seconds with:
-mca orte_heartbeat_rate N
on your command line. Or if you want it to always run, add
"orte_heartbeat_rate = N" to your default MCA param file. OMPI will
declare the or
It turns out that the use of --host and --hostfile act as a filter of
which nodes to run on when you are running under SGE. So, listing them
several times does not affect where the processes land. However, this
still does not explain why you are seeing what you are seeing. One
thing you can
I mean killed the orted deamon process during the mpi job running , but the mpi
job hang and could't notice one of it's rank failed.
> Date: Wed, 1 Apr 2009 19:09:34 +0800
> From: ml.jgmben...@mailsnare.net
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Beginner's question: how to av
Hi all,
I have compiled OpenMPI 1.2.7 with Intel Compilers (icc and ifort) in
a cluster with Centos 4.7. It was ok, but when I try to launch an
execution, mpirun can't find some libraries.
When I check the linked libraries in the nodes, the output was:
[marce@nodo1 ~]$ ldd /home/aplicaciones/ope
Is there a firewall somewhere ?
Jerome
Guanyinzhu wrote:
Hi!
I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on
Redhat Linux x86_64.
I run a test like this: just killed the orted process and the job hung
for a long time (hang for 2~3 hours then I killed the job).
I h
Dear OpenMPI developers, m
i have a strange problem during running my application ( 2000
processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
the mca-params.conf:
btl = ^tcp
btl_tcp_if_exclude = eth0,ib0,ib1
oob_tcp_include = eth1,lo,eth0
btl_openib_warn_default_gid_prefix = 0
btl
Hi!
I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat
Linux x86_64.
I run a test like this: just killed the orted process and the job hung for a
long time (hang for 2~3 hours then I killed the job).
I have the follow questions:
when network failed or
Hi Josh,
Yep, adding that "--with-ft=cr" flag did the trick. Thanks.
Cheers,
m
> From: jjhur...@open-mpi.org
> To: us...@open-mpi.org
> Date: Tue, 31 Mar 2009 15:48:05 -0400
> Subject: Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem
>
> I think that the missing configure option might be th
The difference you are seeing here indicates that the "direct" run is
using the rsh launcher, while the other run is using the Torque
launcher.
So I gather that by "direct" you mean that you don't get an allocation
from Maui before running the job, but for the other you do? Otherwise,
OMP
26 matches
Mail list logo