On Mar 5, 2009, at 7:05 PM, Shinta Bonnefoy wrote:
Thanks, the option --mca btl ^openib works fine !
Half of the cluster has Infiniband/OpenFabrics (from node49 to
node96)
and the other half (nodes from 01 to 48) doesn't.
Ah... this explains things. I wonder if we have not tes
If you have a contact with Lahey support, it would be great to contact
them. Perhaps somehow the support in Libtool 2.2.6a wasn't complete...?
On Mar 5, 2009, at 7:28 PM, Tiago Silva wrote:
Yes, I am using 8.1a
lfc --version
Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a
Tiago Silv
Yes, I am using 8.1a
lfc --version
Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a
Tiago Silva wrote:
Thanks,
I am reporting what I found out for the benefit of other lahey users out
there. I have been told by people at Lahey that libtool has been updated
to support their compiler.
ht
Hi Jeff,
Thanks, the option --mca btl ^openib works fine !
Half of the cluster has Infiniband/OpenFabrics (from node49 to node96)
and the other half (nodes from 01 to 48) doesn't.
I just wanted to make openmpi run over ethernet/tcp first.
I will try to make it run using OpenFabrics but I gue
Many thanks for your help, it was not clear to me whether it was opal,
my application or the standard C libs that were causing the segfault. It
is already good news that the problem is not at the level of OpenMPI,
since this would have meant upgrading that library. My first reaction
would be to
Absolutely :) The last few entries on the stack are from OPAL (one of
the Open MPI libraries) that trap the segfault. Everything else
indicates where the segfault happened. What I can tell from this stack
trace is the following: the problem started in your function
wait_thread which called
On Mar 5, 2009, at 1:29 PM, Jeff Squyres wrote:
On Mar 5, 2009, at 1:54 AM, Sangamesh B wrote:
The fortran application I'm using here is the CPMD-3.11.
I don't think the processor is Nehalem:
Intel(R) Xeon(R) CPU X5472 @ 3.00GHz
Installation procedure was same on both the cluste
We have an application that runs for a very long time with 16 processes
(the time is order a few months; we do have check points, but this won't
be the issue). It has happened twice that it fails with the error
message appended below after running undisturbed for 20-25 days. It has
happened twi
The fw version 2.3.0 is too old. I recommend you to upgrade to the
latest version (2.6.0) from
Mellanox website
http://www.mellanox.com/content/pages.php?pg=firmware_table_ConnectXIB
Thanks,
Pasha
Jeff Layton wrote:
Oops. I ran it on the head node and not the compute node. Here is the
output
On Mar 5, 2009, at 5:14 PM, Tiago Silva wrote:
I am reporting what I found out for the benefit of other lahey users
out
there. I have been told by people at Lahey that libtool has been
updated
to support their compiler.
http://www.linux-archive.org/archlinux-development/156171-libtool-2-2-6a
Is gamess calling fork(), perchance? Perhaps through a system() or
popen() call?
On Mar 5, 2009, at 3:50 AM, Thomas Exner wrote:
Dear Jeff:
Thank you very much for your reply. Unfortunately, the overloading is
not the problem. The phenomenon also appears if we use only two
processes on the
Whoops; we shouldn't be seg faulting. :-\
The warning is exactly what it implies -- it found the OpenFabrics
network stack by no functioning OpenFabrics-capable hardware. You can
disable it (and the segv) by disabling the openfabrics BTL from running:
mpirun --mca btl ^openib
But what
Oops. I ran it on the head node and not the compute node. Here is the
output from a compute node:
hca_id: mlx4_0
fw_ver: 2.3.000
node_guid: 0018:8b90:97fe:1b6d
sys_image_guid: 0018:8b90:97fe:1b70
vendor_id:
Thanks,
I am reporting what I found out for the benefit of other lahey users out
there. I have been told by people at Lahey that libtool has been updated
to support their compiler.
http://www.linux-archive.org/archlinux-development/156171-libtool-2-2-6a-1-a.html
Unfortunately this seems to be
Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.
Pasha
Jeff Layton wrote:
Pasha,
Here you go... :) Thanks for looking at this.
Jeff
hca_id: mthca0
fw_ver:
Thanks Pasha!
ibdiagnet reports the following:
-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote:
>
> >Time to dig up diagnostics tools and look at port statistics.
> >
> You may use ibdiagnet tool for the network debug -
> *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
>
> Pasha.
> __
Pasha,
Here you go... :) Thanks for looking at this.
Jeff
hca_id: mthca0
fw_ver: 4.8.200
node_guid: 0003:ba00:0100:38ac
sys_image_guid: 0003:ba00:0100:38af
vendor_id: 0x02c9
vend
On Mar 5, 2009, at 1:54 AM, Sangamesh B wrote:
The fortran application I'm using here is the CPMD-3.11.
I don't think the processor is Nehalem:
Intel(R) Xeon(R) CPU X5472 @ 3.00GHz
Installation procedure was same on both the clusters. I've not set
mpi_affinity.
This is a memory
Joe Landman wrote:
Ralph Castain wrote:
Ummmnot to put gasoline on the fire, but...if the data exchange is
blocking, why do you need to call a barrier op first? Just use an
appropriate blocking data exchange call (collective or whatever) and
it will "barrier" anyway.
Since I don't run t
Hi All
Joe Landman wrote:
Ralph Castain wrote:
Ummmnot to put gasoline on the fire, but...if the data exchange is
blocking, why do you need to call a barrier op first? Just use an
appropriate blocking data exchange call (collective or whatever) and
it will "barrier" anyway.
Since I don
First, you can add --launch-agent rsh to the command line and that
will have OMPI use rsh.
It sounds like your remote nodes may not be seeing your OMPI install
directory. Several ways you can resolve that - here are a couple:
1. add the install directory to your LD_LIBRARY_PATH in your .csh
Could you tell us what version of Open MPI you are using, a little
about your system (I would assume you are using ssh?), and how this
was configured?
Thanks
Ralph
On Mar 5, 2009, at 9:31 AM, justin oppenheim wrote:
Hi:
When I execute something like
mpirun -machinefile machinefile my_mp
Jeff Squyres wrote:
If you're exchanging data at the end of an iteration, then you
effectively have a synchronization anyway -- no need for an extra
barrier synchronization.
Ralph Castain wrote:
Ummmnot to put gasoline on the fire, but...if the data exchange
is blocking, why do you n
Hi:
When I execute something like
mpirun -machinefile machinefile my_mpi_executable
I get something like this
my_mpi_executable symbol lookup error: remote_openmpi/lib/libmpi_cxx.so.0:
undefined symbol: ompi_registered_datareps
where both my_mpi_executable and remote_openmpi are installed o
Ralph Castain wrote:
Ummmnot to put gasoline on the fire, but...if the data exchange is
blocking, why do you need to call a barrier op first? Just use an
appropriate blocking data exchange call (collective or whatever) and it
will "barrier" anyway.
Since I don't run these codes, I would
Bah, I should have been more precise in this:
not just any old tests/benchmarks but
recommended, reliable tests/benchmarks?
Yury Tarasievich wrote:
Are there any recommended tests/benchmarks for the heterogenous
installations? I'd like to have something measuring the throughput of
lengthy comp
On Mar 5, 2009, at 8:50 AM, Joe Landman wrote:
Jeff Squyres wrote:
On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote:
We've been playing with it in a coupled atmosphere-ocean model to
allow
the two to synchronize and exchange data. The models have differing
levels of physics complexity and
Jeff
I would perhaps remember your statement like part of a religious scripture!
Request to you and everyone else: if you know of a good book and/or
online tutorial on 'how to write large parallel scientific programs',
I am sure it would be of immense use to everyone in this list.
Best regards
Thank you, Jeff and Ganesh.
My current research is trying to rewrite some collective MPI
operations to work with our system. Barrier is my first step, maybe I
will have bcast and reduce in the future. I understand that some
applications used too many unnecessary barriers. But here what I
Thank you, Jeff and Ganesh.
My current research is trying to rewrite some collective MPI
operations to work with our system. Barrier is my first step, maybe
I will have bcast and reduce in the future. I understand that some
applications used too many unnecessary barriers. But here what I
Jeff Squyres wrote:
On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote:
We've been playing with it in a coupled atmosphere-ocean model to allow
the two to synchronize and exchange data. The models have differing
levels of physics complexity and the time step requirements are
significantly differ
On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote:
We've been playing with it in a coupled atmosphere-ocean model to
allow
the two to synchronize and exchange data. The models have differing
levels of physics complexity and the time step requirements are
significantly different. To sync them
We've been playing with it in a coupled atmosphere-ocean model to allow
the two to synchronize and exchange data. The models have differing
levels of physics complexity and the time step requirements are
significantly different. To sync them up we have to know where the
timesteps are identica
We have a paper on the very topic that Jeff just mentioned :
Subodh Sharma, Sarvani Vakkalanka, Ganesh Gopalakrishnan, Robert M.
Kirby, Rajeev Thakur, and William Gropp, `` A Formal Approach to Detect
Functionally Irrelevant Barriers in MPI Programs,'' Recent Advances in
Parallel Virtual Machi
On Mar 5, 2009, at 9:29 AM, Shanyuan Gao wrote:
I am doing some research on MPI barrier operations. And I am ready
to do some performance test.
I wonder if there are any applications that using barriers a lot.
Please let me know if there
is any. Any comments are welcomed. Thanks!
I don't
Hi,
I am doing some research on MPI barrier operations. And I am ready
to do some performance test.
I wonder if there are any applications that using barriers a lot.
Please let me know if there
is any. Any comments are welcomed. Thanks!
Shan
Are there any recommended tests/benchmarks for the heterogenous
installations? I'd like to have something measuring the throughput of
lengthy computations, which would be executed on the installation with
the heterogenous nodes.
Thanks.
Dear Jeff:
Thank you very much for your reply. Unfortunately, the overloading is
not the problem. The phenomenon also appears if we use only two
processes on the 8core machines. When I run the jobs over two nodes, one
is doing nothing anymore after a couple of minutes. The strange thing
is that t
Jeff,
Can you please provide more information about you HCA type (ibv_devinfo -v).
Do you see this error immediate during startup, or you get it during
your run ?
Thanks,
Pasha
Jeff Layton wrote:
Evening everyone,
I'm running a CFD code on IB and I've encountered an error I'm not
sure about
Time to dig up diagnostics tools and look at port statistics.
You may use ibdiagnet tool for the network debug -
*http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
Pasha.
The fortran application I'm using here is the CPMD-3.11.
I don't think the processor is Nehalem:
Intel(R) Xeon(R) CPU X5472 @ 3.00GHz
Installation procedure was same on both the clusters. I've not set mpi_affinity.
This is a memory intensive application, but this job was not using
th
Hi,
I am the admin of a small cluster (server running under SLES 10.1 and
nodes on OSS 10.3).
and I have just installed openmpi 1.3 on it.
I'm trying to get a simple program (like hello world) running but it
fails all the time on on of the node but never on the others.
I don't think it's related
43 matches
Mail list logo