Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread François Trahay
Well, according to George Bosilca (http://www.open-mpi.org/community/lists/users/2005/02/0005.php), threads are supported in OpenMPI. The program I try to run works with the TCP stack and MX driver is thread-safe, so i guess the problem comes from the MX BTL or MTL. Francois Scott Atchley wr

[OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications

2009-06-11 Thread jody
Hi After updating all my nodes to Open-MPI 1.3.2 (with --enable-mpi-threads some of them fail to execute a simple MPI test program - they seem to hang. With --debug-daemons the application seems to execute (two line os output) but hangs before returning: [jody@aplankton neander]$ mpirun -np 2 --h

Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications

2009-06-11 Thread jody
More info: I checked and found that not all nodes are equal: the ones that don't work have mpi-threads *and* progress-threads enabled, whereas the ones that work have only mpi-threads enabled Is there a problem when both thread-types are enabled? Jody On Thu, Jun 11, 2009 at 12:19 PM, jody wrote

Re: [OMPI users] Problems with Open MPI/BLCR checkpoint/restart routine.

2009-06-11 Thread pat . o'bryant
Gleb, I am trying to use BLCR as well. What levels of OpenMPI, OFED, and BLCR are you using? I can get a serial checkpoint/restart to work but not the parallel case. I built my system using OFED 1.3.1, OpenMPI 1.3.1, and BLCR 0.8.1-1. I also used your same BLCR configuration options for OpenM

Re: [OMPI users] After upgrading to 1.3.2 some nodes hang on MPI-Applications

2009-06-11 Thread Ralph Castain
It's the --enable-progress-threads flag that causes the problem - we don't really support that yet. Maybe someday. Take that out and you should be okay, with the caveats expressed on the OMPI web site (i.e., not everything works with threads yet). On Jun 11, 2009, at 4:56 AM, jody wrote:

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Scott Atchley
Francois, For threads, the FAQ has: http://www.open-mpi.org/faq/?category=supported-systems#thread-support It mentions that thread support is designed in, but lightly tested. It is also possible that the FAQ is out of date and MPI_THREAD_MULTIPLE is fully supported. The stack trace below

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread George Bosilca
The comment on the FAQ (and on the other thread) is only true for some BTLs (TCP, SM and MX). I don't have resources to test for the others BTL, it is their developers responsibility to do the required modifications to make them thread safe. In addition, I have to confess that I never teste

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Brian Barrett
Neither the CM PML or the MX MTL has been looked at for thread safety. There's not much code to cause problems in the CM PML. The MX MTL would likely need some work to ensure the restrictions Scott mentioned are met (currently, there's no such guarantee in the MX MTL). Brian On Jun 11, 2

[OMPI users] Using rsh instead of ssh during ompi-restart

2009-06-11 Thread Gleb "Crazy Sage" Igumnov
Hello. I've got following problem: I'm trying to restart parallel job over our cluster using following command line: /common/openmpi-1.3.2/ompi-restart -mca plm-rsh-agent rsh -verbose -hostfile hfile ompi_global_snapshot_25229.ckpt despite of using such mca option I got following error message:

Re: [OMPI users] Using rsh instead of ssh during ompi-restart

2009-06-11 Thread Ralph Castain
The problem is that you misspelled the mca param - it should be: -mca plm_rsh_agent rsh On Jun 11, 2009, at 10:34 AM, Gleb Crazy Sage Igumnov wrote: Hello. I've got following problem: I'm trying to restart parallel job over our cluster using following command line: /common/openmpi-1.3.2/ompi-

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Scott Atchley
Brian and George, I do not know if the stack trace is complete, but I do not see any mx_* functions called which would indicate a crash inside MX due to multiple threads trying to complete the same request. It does show an assert failed. Francois, is the stack trace from the MX MTL or BTL

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread François Trahay
The stack trace is from the MX MTL (I attach the backtraces I get with both MX MTL and MX BTL) Here is the program that I use. It is quite simple. It runs ping pongs concurrently (with one thread per node, then with two threads per node, etc.) The error occurs when two threads run concurrently.

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Brian Barrett
Almost assuredly, the MTL is not thread safe, and such support is unlikely to happen in the short term. You might be better off concentrating on the BTL, as George has done significant work on that front. Brian On Jun 11, 2009, at 12:20 PM, François Trahay wrote: The stack trace is from

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread George Bosilca
Based on the stack trace, at one point (depth 4) we are in the MX MTL and then we call free. It might happens that two threads call free simultaneously ... It is a guess, as there is not enough information to corroborate this. george. On Jun 11, 2009, at 13:17 , Scott Atchley wrote: Br

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread Scott Atchley
On Jun 11, 2009, at 2:20 PM, François Trahay wrote: The stack trace is from the MX MTL (I attach the backtraces I get with both MX MTL and MX BTL) Here is the program that I use. It is quite simple. It runs ping pongs concurrently (with one thread per node, then with two threads per node, e

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread François Trahay
Oops. Here's the trace using the BTL. Francois Scott Atchley wrote: By specifying --mca pml cm, both traces are using the MTL. To use the BTL, try: $ mpiexec --mca btl mx,sm,self -machinefile ./joe -np 2 ./concurrent_ping or simply: $ mpiexec -machinefile ./joe -np 2 ./concurrent_ping Scot

[OMPI users] Intermittent corruption

2009-06-11 Thread Nick Collier
Hi, I'm developing under OSX 10.5.7 with Open-MPI 1.3.2 and am running into intermittent corruption when send / recv user defined data type. When running with less than four processes (i.e. mpirun -np [2,3]), the data is fine, when running with 4 or more the received data is intermittent

Re: [OMPI users] Problem with OpenMPI (MX btl and mtl) and threads

2009-06-11 Thread George Bosilca
I will take a look at the BTL problem. Can you provide a copy of the benchmarks please. Thanks, george. On Jun 11, 2009, at 16:05 , François Trahay wrote: concurrent_ping

[OMPI users] MPI-IO: reading an unformatted binary fortran file

2009-06-11 Thread Greg Fischer
Hello, I'm attempting to wrap my brain around the MPI I/O mechanisms, and I was hoping to find some guidance. I'm trying to read a file that contains a 117-character string, followed by a series records that contain integers and reals. The following code would read it in serial: --- character(l

Re: [OMPI users] Intermittent corruption

2009-06-11 Thread George Bosilca
Did you try to follow the advice on the LAPACK mailing list, i.e. upgrade your compiler from the MAC OS X default (4.0.1) to 4.3.0 ? Btw, what is the test you're running? Can you create a small test case so I can try to reproduce it? Thanks, george. On Jun 11, 2009, at 17:02 , Nick Colli