Re: [OMPI users] Runtime error only on one node.

2009-03-05 Thread Jeff Squyres
On Mar 5, 2009, at 7:05 PM, Shinta Bonnefoy wrote: Thanks, the option --mca btl ^openib works fine ! Half of the cluster has Infiniband/OpenFabrics (from node49 to node96) and the other half (nodes from 01 to 48) doesn't. Ah... this explains things. I wonder if we have not tes

Re: [OMPI users] Lahey 64 bit and openmpi 1.3?

2009-03-05 Thread Jeff Squyres
If you have a contact with Lahey support, it would be great to contact them. Perhaps somehow the support in Libtool 2.2.6a wasn't complete...? On Mar 5, 2009, at 7:28 PM, Tiago Silva wrote: Yes, I am using 8.1a lfc --version Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a Tiago Silv

Re: [OMPI users] Lahey 64 bit and openmpi 1.3?

2009-03-05 Thread Tiago Silva
Yes, I am using 8.1a lfc --version Lahey/Fujitsu Linux64 Fortran Compiler Release L8.10a Tiago Silva wrote: Thanks, I am reporting what I found out for the benefit of other lahey users out there. I have been told by people at Lahey that libtool has been updated to support their compiler. ht

Re: [OMPI users] Runtime error only on one node.

2009-03-05 Thread Shinta Bonnefoy
Hi Jeff, Thanks, the option --mca btl ^openib works fine ! Half of the cluster has Infiniband/OpenFabrics (from node49 to node96) and the other half (nodes from 01 to 48) doesn't. I just wanted to make openmpi run over ethernet/tcp first. I will try to make it run using OpenFabrics but I gue

Re: [OMPI users] "casual" error

2009-03-05 Thread Biagio Lucini
Many thanks for your help, it was not clear to me whether it was opal, my application or the standard C libs that were causing the segfault. It is already good news that the problem is not at the level of OpenMPI, since this would have meant upgrading that library. My first reaction would be to

Re: [OMPI users] "casual" error

2009-03-05 Thread George Bosilca
Absolutely :) The last few entries on the stack are from OPAL (one of the Open MPI libraries) that trap the segfault. Everything else indicates where the segfault happened. What I can tell from this stack trace is the following: the problem started in your function wait_thread which called

Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit

2009-03-05 Thread Ralph Castain
On Mar 5, 2009, at 1:29 PM, Jeff Squyres wrote: On Mar 5, 2009, at 1:54 AM, Sangamesh B wrote: The fortran application I'm using here is the CPMD-3.11. I don't think the processor is Nehalem: Intel(R) Xeon(R) CPU X5472 @ 3.00GHz Installation procedure was same on both the cluste

[OMPI users] "casual" error

2009-03-05 Thread Biagio Lucini
We have an application that runs for a very long time with 16 processes (the time is order a few months; we do have check points, but this won't be the issue). It has happened twice that it fails with the error message appended below after running undisturbed for 20-25 days. It has happened twi

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)
The fw version 2.3.0 is too old. I recommend you to upgrade to the latest version (2.6.0) from Mellanox website http://www.mellanox.com/content/pages.php?pg=firmware_table_ConnectXIB Thanks, Pasha Jeff Layton wrote: Oops. I ran it on the head node and not the compute node. Here is the output

Re: [OMPI users] Lahey 64 bit and openmpi 1.3?

2009-03-05 Thread Jeff Squyres
On Mar 5, 2009, at 5:14 PM, Tiago Silva wrote: I am reporting what I found out for the benefit of other lahey users out there. I have been told by people at Lahey that libtool has been updated to support their compiler. http://www.linux-archive.org/archlinux-development/156171-libtool-2-2-6a

Re: [OMPI users] Gamess with openmpi

2009-03-05 Thread Jeff Squyres
Is gamess calling fork(), perchance? Perhaps through a system() or popen() call? On Mar 5, 2009, at 3:50 AM, Thomas Exner wrote: Dear Jeff: Thank you very much for your reply. Unfortunately, the overloading is not the problem. The phenomenon also appears if we use only two processes on the

Re: [OMPI users] Runtime error only on one node.

2009-03-05 Thread Jeff Squyres
Whoops; we shouldn't be seg faulting. :-\ The warning is exactly what it implies -- it found the OpenFabrics network stack by no functioning OpenFabrics-capable hardware. You can disable it (and the segv) by disabling the openfabrics BTL from running: mpirun --mca btl ^openib But what

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Jeff Layton
Oops. I ran it on the head node and not the compute node. Here is the output from a compute node: hca_id: mlx4_0 fw_ver: 2.3.000 node_guid: 0018:8b90:97fe:1b6d sys_image_guid: 0018:8b90:97fe:1b70 vendor_id:

Re: [OMPI users] Lahey 64 bit and openmpi 1.3?

2009-03-05 Thread Tiago Silva
Thanks, I am reporting what I found out for the benefit of other lahey users out there. I have been told by people at Lahey that libtool has been updated to support their compiler. http://www.linux-archive.org/archlinux-development/156171-libtool-2-2-6a-1-a.html Unfortunately this seems to be

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)
Do you have the same HCA adapter type on all of your machines ? In the error log I see mlx4 error message , and mlx4 is connectX driver, but ibv_devinfo show some older hca. Pasha Jeff Layton wrote: Pasha, Here you go... :) Thanks for looking at this. Jeff hca_id: mthca0 fw_ver:

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)
Thanks Pasha! ibdiagnet reports the following: -I--- -I- IPoIB Subnets Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Jan Lindheim
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote: > > >Time to dig up diagnostics tools and look at port statistics. > > > You may use ibdiagnet tool for the network debug - > *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. > > Pasha. > __

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Jeff Layton
Pasha, Here you go... :) Thanks for looking at this. Jeff hca_id: mthca0 fw_ver: 4.8.200 node_guid: 0003:ba00:0100:38ac sys_image_guid: 0003:ba00:0100:38af vendor_id: 0x02c9 vend

Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit

2009-03-05 Thread Jeff Squyres
On Mar 5, 2009, at 1:54 AM, Sangamesh B wrote: The fortran application I'm using here is the CPMD-3.11. I don't think the processor is Nehalem: Intel(R) Xeon(R) CPU X5472 @ 3.00GHz Installation procedure was same on both the clusters. I've not set mpi_affinity. This is a memory

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Gerry Creager
Joe Landman wrote: Ralph Castain wrote: Ummmnot to put gasoline on the fire, but...if the data exchange is blocking, why do you need to call a barrier op first? Just use an appropriate blocking data exchange call (collective or whatever) and it will "barrier" anyway. Since I don't run t

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Gus Correa
Hi All Joe Landman wrote: Ralph Castain wrote: Ummmnot to put gasoline on the fire, but...if the data exchange is blocking, why do you need to call a barrier op first? Just use an appropriate blocking data exchange call (collective or whatever) and it will "barrier" anyway. Since I don

Re: [OMPI users] Run-time problem

2009-03-05 Thread Ralph Castain
First, you can add --launch-agent rsh to the command line and that will have OMPI use rsh. It sounds like your remote nodes may not be seeing your OMPI install directory. Several ways you can resolve that - here are a couple: 1. add the install directory to your LD_LIBRARY_PATH in your .csh

Re: [OMPI users] Run-time problem

2009-03-05 Thread Ralph Castain
Could you tell us what version of Open MPI you are using, a little about your system (I would assume you are using ssh?), and how this was configured? Thanks Ralph On Mar 5, 2009, at 9:31 AM, justin oppenheim wrote: Hi: When I execute something like mpirun -machinefile machinefile my_mp

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Eugene Loh
Jeff Squyres wrote: If you're exchanging data at the end of an iteration, then you effectively have a synchronization anyway -- no need for an extra barrier synchronization. Ralph Castain wrote: Ummmnot to put gasoline on the fire, but...if the data exchange is blocking, why do you n

[OMPI users] Run-time problem

2009-03-05 Thread justin oppenheim
Hi: When I execute something like mpirun -machinefile machinefile my_mpi_executable I get something like this my_mpi_executable symbol lookup error: remote_openmpi/lib/libmpi_cxx.so.0: undefined symbol: ompi_registered_datareps where both my_mpi_executable and remote_openmpi are installed o

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Joe Landman
Ralph Castain wrote: Ummmnot to put gasoline on the fire, but...if the data exchange is blocking, why do you need to call a barrier op first? Just use an appropriate blocking data exchange call (collective or whatever) and it will "barrier" anyway. Since I don't run these codes, I would

Re: [OMPI users] tests for heterogenous installations?

2009-03-05 Thread Yury Tarasievich
Bah, I should have been more precise in this: not just any old tests/benchmarks but recommended, reliable tests/benchmarks? Yury Tarasievich wrote: Are there any recommended tests/benchmarks for the heterogenous installations? I'd like to have something measuring the throughput of lengthy comp

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Ralph Castain
On Mar 5, 2009, at 8:50 AM, Joe Landman wrote: Jeff Squyres wrote: On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote: We've been playing with it in a coupled atmosphere-ocean model to allow the two to synchronize and exchange data. The models have differing levels of physics complexity and

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Durga Choudhury
Jeff I would perhaps remember your statement like part of a religious scripture! Request to you and everyone else: if you know of a good book and/or online tutorial on 'how to write large parallel scientific programs', I am sure it would be of immense use to everyone in this list. Best regards

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Ganesh
Thank you, Jeff and Ganesh. My current research is trying to rewrite some collective MPI operations to work with our system. Barrier is my first step, maybe I will have bcast and reduce in the future. I understand that some applications used too many unnecessary barriers. But here what I

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Shanyuan Gao
Thank you, Jeff and Ganesh. My current research is trying to rewrite some collective MPI operations to work with our system. Barrier is my first step, maybe I will have bcast and reduce in the future. I understand that some applications used too many unnecessary barriers. But here what I

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Joe Landman
Jeff Squyres wrote: On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote: We've been playing with it in a coupled atmosphere-ocean model to allow the two to synchronize and exchange data. The models have differing levels of physics complexity and the time step requirements are significantly differ

Re: [OMPI users] Any scientific application heavilyusing MPI_Barrier?

2009-03-05 Thread Jeff Squyres
On Mar 5, 2009, at 10:33 AM, Gerry Creager wrote: We've been playing with it in a coupled atmosphere-ocean model to allow the two to synchronize and exchange data. The models have differing levels of physics complexity and the time step requirements are significantly different. To sync them

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Gerry Creager
We've been playing with it in a coupled atmosphere-ocean model to allow the two to synchronize and exchange data. The models have differing levels of physics complexity and the time step requirements are significantly different. To sync them up we have to know where the timesteps are identica

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Ganesh
We have a paper on the very topic that Jeff just mentioned : Subodh Sharma, Sarvani Vakkalanka, Ganesh Gopalakrishnan, Robert M. Kirby, Rajeev Thakur, and William Gropp, `` A Formal Approach to Detect Functionally Irrelevant Barriers in MPI Programs,'' Recent Advances in Parallel Virtual Machi

Re: [OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Jeff Squyres
On Mar 5, 2009, at 9:29 AM, Shanyuan Gao wrote: I am doing some research on MPI barrier operations. And I am ready to do some performance test. I wonder if there are any applications that using barriers a lot. Please let me know if there is any. Any comments are welcomed. Thanks! I don't

[OMPI users] Any scientific application heavily using MPI_Barrier?

2009-03-05 Thread Shanyuan Gao
Hi, I am doing some research on MPI barrier operations. And I am ready to do some performance test. I wonder if there are any applications that using barriers a lot. Please let me know if there is any. Any comments are welcomed. Thanks! Shan

[OMPI users] tests for heterogenous installations?

2009-03-05 Thread Yury Tarasievich
Are there any recommended tests/benchmarks for the heterogenous installations? I'd like to have something measuring the throughput of lengthy computations, which would be executed on the installation with the heterogenous nodes. Thanks.

Re: [OMPI users] Gamess with openmpi

2009-03-05 Thread Thomas Exner
Dear Jeff: Thank you very much for your reply. Unfortunately, the overloading is not the problem. The phenomenon also appears if we use only two processes on the 8core machines. When I run the jobs over two nodes, one is doing nothing anymore after a couple of minutes. The strange thing is that t

Re: [OMPI users] mlx4 error - looking for guidance

2009-03-05 Thread Pavel Shamis (Pasha)
Jeff, Can you please provide more information about you HCA type (ibv_devinfo -v). Do you see this error immediate during startup, or you get it during your run ? Thanks, Pasha Jeff Layton wrote: Evening everyone, I'm running a CFD code on IB and I've encountered an error I'm not sure about

Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)
Time to dig up diagnostics tools and look at port statistics. You may use ibdiagnet tool for the network debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. Pasha.

Re: [OMPI users] Low performance of Open MPI-1.3 over Gigabit

2009-03-05 Thread Sangamesh B
The fortran application I'm using here is the CPMD-3.11. I don't think the processor is Nehalem: Intel(R) Xeon(R) CPU X5472 @ 3.00GHz Installation procedure was same on both the clusters. I've not set mpi_affinity. This is a memory intensive application, but this job was not using th

[OMPI users] Runtime error only on one node.

2009-03-05 Thread Shinta Bonnefoy
Hi, I am the admin of a small cluster (server running under SLES 10.1 and nodes on OSS 10.3). and I have just installed openmpi 1.3 on it. I'm trying to get a simple program (like hello world) running but it fails all the time on on of the node but never on the others. I don't think it's related