date:20090401

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Guanyinzhu

thank you very much! The option -mca orte_heartbeat_rate N is very usefull do detect failures like host or network failed or orted deamon killed for the running mpi job. I have another question: I use ssh for openmpi remote connect, but sometimes a host doesn't answer ssh login request,

Re: [OMPI users] Open MPI 2009 released

2009-04-01 Thread Damien Hocking

Outstanding. I'll have two. Damien George Bosilca wrote: The Open MPI Team, representing a consortium of bailed-out banks, car manufacturers, and insurance companies, is pleased to announce the release of the "unbreakable" / bug-free version Open MPI 2009, (expected to be available by mid-2011

[OMPI users] Open MPI 2009 released

2009-04-01 Thread George Bosilca

The Open MPI Team, representing a consortium of bailed-out banks, car manufacturers, and insurance companies, is pleased to announce the release of the "unbreakable" / bug-free version Open MPI 2009, (expected to be available by mid-2011). This release is essentially a complete rewrite of Open MP

Re: [OMPI users] Cannot build OpenMPI 1.3 with PGI pgf90 and Gnu gcc/g++.

2009-04-01 Thread Jeff Squyres

On Mar 31, 2009, at 4:21 PM, Gus Correa wrote: Please, correct my argument below if I am wrong. I am not sure yet if the problem is caused by libtool, because somehow it was not present in OpenMPI 1.2.8. Just as a comparison, the libtool commands on 1.2.8 and 1.3 are very similar, although 1.2.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Rahul Nabar

On Wed, Apr 1, 2009 at 1:13 AM, Ralph Castain wrote: > So I gather that by "direct" you mean that you don't get an allocation from > Maui before running the job, but for the other you do? Otherwise, OMPI > should detect the that it is running under Torque and automatically use the > Torque launche

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-04-01 Thread Josh Hursey

On Apr 1, 2009, at 12:42 PM, Dave Love wrote: Josh Hursey writes: The configure flag that you are looking for is: --with-ft=cr Is there a good reason why --with-blcr doesn't imply it? Not really. Though it is most likely difficult to make it happen given the configure logic in Open MPI

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread Dave Love

Rolf Vandevaart writes: > No, orte_leave_session_attached is needed to avoid the errno=2 errors > from the sm btl. (It is fixed in 1.3.2 and trunk) [It does cause other trouble, but I forget what the exact behaviour was when I lost it as a default.] >> Yes, but there's a problem with the recomm

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-04-01 Thread Dave Love

Josh Hursey writes: > The configure flag that you are looking for is: > --with-ft=cr Is there a good reason why --with-blcr doesn't imply it? > You may also want to consider using the thread options too for > improved C/R response: > --enable-mpi-threads --enable-ft-thread Incidentally, the

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread PN

Thanks. $ cat hpl-8cpu-test.sge #!/bin/bash # #$ -N HPL_8cpu_GB #$ -pe orte 8 #$ -cwd #$ -j y #$ -S /bin/bash #$ -V # /opt/openmpi-gcc/bin/mpirun --display-allocation --display-map -v -np $NSLOTS --host node0001,node0002 hostname $ cat HPL_8cpu_GB.o46 == ALLOCATED NODES

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread PN

Thanks. I've tried your suggestion. $ cat hpl-8cpu-test.sge #!/bin/bash # #$ -N HPL_8cpu_GB #$ -pe orte 8 #$ -cwd #$ -j y #$ -S /bin/bash #$ -V # /opt/openmpi-gcc/bin/mpirun -mca ras_gridengine_verbose 100 -v -np $NSLOTS --host node0001,node0002 hostname It allocated 2 nodes to run, however all

[OMPI users] mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

2009-04-01 Thread Alessandro Surace

Hi guys, I try to repost my question... I've a problem with the last stable build and the last nightly snapshot. When I run a job directly with mpirun no problem. If I try to submit it with lsf: bsub -a openmpi -m grid01 mpirun.lsf /mnt/ewd/mpi/fibonacci/fibonacci_mpi I get the follow error: mpir

Re: [OMPI users] mpirun interaction with pbsdsh

2009-04-01 Thread Ralph Castain

Ick is the proper response. :-) The old 1.2 series would attempt to spawn a local orted on each of those nodes, and that is what is failing. Best guess is that it is because pbsdsh doesn't fully replicate a key part of the environment that is expected. One thing you could try is do this w

[OMPI users] mpirun interaction with pbsdsh

2009-04-01 Thread Brock Palen

Ok this is weird, and the correct answer is probably "don't do that", Anyway: User wants to run many many small jobs, faster than our scheduler +torque can start, he uses pbsdsh to start them in parallel, under tm. pbsdsh bash -c 'cd $PBS_O_WORKDIR/$PBS_VNODENUM; mpirun -np 1 application'

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread Ralph Castain

Rolf has correctly reminded me that display-allocation occurs prior to host filtering, so you will see all of the allocated nodes. You'll see the impact of the host specifications in display-map, Sorry for the confusion - thanks to Rolf for pointing it out. Ralph On Apr 1, 2009, at 7:40 AM,

Re: [OMPI users] Strange Net problem

2009-04-01 Thread Gabriele Fatigati

Hi Ralph, unfortunately, in this machine i can't upgrade OpenMPI at the moment. Is there a way to limit or to reduce the probability of this error? 2009/4/1 Ralph Castain : > Hi Gabriele > > I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to > that size due to a requireme

Re: [OMPI users] Strange Net problem

2009-04-01 Thread Ralph Castain

Hi Gabriele I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to that size due to a requirement that the underlying out-of-band system fully connect at the TCP level. Thus, every process in your job will be opening 2002 sockets (one to every other process, one to the

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread Ralph Castain

As an FYI: you can debug allocation issues more easily by: mpirun --display-allocation --do-not-launch -n 1 foo This will read the allocation, do whatever host filtering you specify with -host and -hostfile options, report out the result, and then terminate without trying to launch anything.

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Ralph Castain

There is indeed a heartbeat mechanism you can use - it is "off" by default. You can set it to check every N seconds with: -mca orte_heartbeat_rate N on your command line. Or if you want it to always run, add "orte_heartbeat_rate = N" to your default MCA param file. OMPI will declare the or

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread Rolf Vandevaart

It turns out that the use of --host and --hostfile act as a filter of which nodes to run on when you are running under SGE. So, listing them several times does not affect where the processes land. However, this still does not explain why you are seeing what you are seeing. One thing you can

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Guanyinzhu

I mean killed the orted deamon process during the mpi job running , but the mpi job hang and could't notice one of it's rank failed. > Date: Wed, 1 Apr 2009 19:09:34 +0800 > From: ml.jgmben...@mailsnare.net > To: us...@open-mpi.org > Subject: Re: [OMPI users] Beginner's question: how to av

[OMPI users] Can't find libsvml in the execution

2009-04-01 Thread Marce

Hi all, I have compiled OpenMPI 1.2.7 with Intel Compilers (icc and ifort) in a cluster with Centos 4.7. It was ok, but when I try to launch an execution, mpirun can't find some libraries. When I check the linked libraries in the nodes, the output was: [marce@nodo1 ~]$ ldd /home/aplicaciones/ope

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Jerome BENOIT

Is there a firewall somewhere ? Jerome Guanyinzhu wrote: Hi! I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat Linux x86_64. I run a test like this: just killed the orted process and the job hung for a long time (hang for 2~3 hours then I killed the job). I h

[OMPI users] Strange Net problem

2009-04-01 Thread Gabriele Fatigati

Dear OpenMPI developers, m i have a strange problem during running my application ( 2000 processors). I'm using openmpi 1.2.22 over Infiniband. The follow is the mca-params.conf: btl = ^tcp btl_tcp_if_exclude = eth0,ib0,ib1 oob_tcp_include = eth1,lo,eth0 btl_openib_warn_default_gid_prefix = 0 btl

[OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

2009-04-01 Thread Guanyinzhu

Hi! I'm using OpenMPI 1.3 on ten nodes connected with Gigabit Ethernet on Redhat Linux x86_64. I run a test like this: just killed the orted process and the job hung for a long time (hang for 2~3 hours then I killed the job). I have the follow questions: when network failed or

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

2009-04-01 Thread M C

Hi Josh, Yep, adding that "--with-ft=cr" flag did the trick. Thanks. Cheers, m > From: jjhur...@open-mpi.org > To: us...@open-mpi.org > Date: Tue, 31 Mar 2009 15:48:05 -0400 > Subject: Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem > > I think that the missing configure option might be th

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

2009-04-01 Thread Ralph Castain

The difference you are seeing here indicates that the "direct" run is using the rsh launcher, while the other run is using the Torque launcher. So I gather that by "direct" you mean that you don't get an allocation from Maui before running the job, but for the other you do? Otherwise, OMP

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

Re: [OMPI users] Open MPI 2009 released

[OMPI users] Open MPI 2009 released

Re: [OMPI users] Cannot build OpenMPI 1.3 with PGI pgf90 and Gnu gcc/g++.

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

[OMPI users] mpirun: symbol lookup error: /usr/local/lib/openmpi/mca_plm_lsf.so: undefined symbol: ls b_init

Re: [OMPI users] mpirun interaction with pbsdsh

[OMPI users] mpirun interaction with pbsdsh

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Re: [OMPI users] Strange Net problem

Re: [OMPI users] Strange Net problem

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

[OMPI users] Can't find libsvml in the execution

Re: [OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

[OMPI users] Strange Net problem

[OMPI users] Beginner's question: how to avoid a running mpi job hang if host or network failed or orted deamon killed?

Re: [OMPI users] OpenMPI 1.3.1 + BLCR build problem

Re: [OMPI users] job runs with mpirun on a node but not if submitted via Torque.

26 matches

Site Navigation

Mail list logo

Footer information