Thanks Ashley,  I'll try your tool..

I would think that this is an error in the programs I am trying to use, too, 
but this is a problem with 2 different programs, written by 2 different 
groups.. One of them might be bad, but both.. seems unlikely. 

Interestingly the results for the connectivity_c test that is included with 
OMPI... works fine with -np <8. For -np >8 it works some of the time, other 
times it HANGS. I have got to believe that this is a big clue!! Also, when it 
hangs, sometimes I get the message "mpirun was unable to cleanly terminate the 
daemons on the nodes shown below" Note that NO nodes are shown below.   Once, I 
got -np 250 to pass the connectivity test, but I was not able to replicate this 
reliable, so I'm not sure if it was a fluke, or what.  Here is a like to a 
screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 
processes are only at 50% CPU usage.. Hmmmm  

http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink

The other tests, ring_c, hello_c, as well as the cxx versions of these guys 
with with all values of -np.

Unfortunately, I could not get valgrind to work...

Thanks, Matt



On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote:

> On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote:
>> There are 8 physical cores, or 16 with hyperthreading enabled. 
> 
> That should be meaty enough.
> 
>> 1st of all, let me say that when I specify that -np is less than 4
>> processors (1, 2, or 3), both programs seem to work as expected. Also,
>> the non-mpi version of each of them works fine.
> 
> Presumably the non-mpi version is serial however? this this doesn't mean
> the program is bug-free or that the parallel version isn't broken.
> There are any number of apps that don't work above N processes, in fact
> probably all programs break for some value of N, it's normally a little
> higher then 3 however.
> 
>> Thus, I am pretty sure that this is a problem with MPI rather that
>> with the program code or something else.  
>> 
>> What happens is simply that the program hangs..
> 
> I presume you mean here the output stops?  The program continues to use
> CPU cycles but no longer appears to make any progress?
> 
> I'm of the opinion that this is most likely a error in your program, I
> would start by using either valgrind or padb.
> 
> You can run the app under valgrind using the following mpirun options,
> this will give you four files named v.log.0 to v.log.3 which you can
> check for errors in the normal way.  The "--mca btl tcp,self" option
> will disable shared memory which can create false positives.
> 
> mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.%
> q{OMPI_COMM_WORLD_RANK} <app>
> 
> Alternatively you can run the application, wait for it to hang and then
> in another window run my tool, padb, which will show you the MPI message
> queues and stack traces which should show you where it's hung,
> instructions and sample output are on this page.
> 
> http://padb.pittman.org.uk/full-report.html
> 
>> There are no error messages, and there is no clue from anything else
>> (system working fine otherwise- no RAM issues, etc). It does not hang
>> at the same place everytime, sometimes in the very beginning, sometime
>> near the middle..  
>> 
>> Could this an issue with hyperthreading? A conflict with something?
> 
> Unlikely, if there was a problem in OMPI running more than 3 processes
> it would have been found by now.  I regularly run 8 process applications
> on my dual-core netbook alongside all my desktop processes without
> issue, it runs fine, a little slowly but fine.
> 
> All this talk about binding and affinity won't help either, process
> binding is about squeezing the last 15% of performance out of a system
> and making performance reproducible, it has no bearing on correctness or
> scalability.  If you're not running on a dedicated machine which with
> firefox running I guess you aren't then there would be a good case for
> leaving it off anyway.
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_________________________________
Matthew MacManes
PhD Candidate
University of California- Berkeley
Museum of Vertebrate Zoology
Phone: 510-495-5833
Lab Website: http://ib.berkeley.edu/labs/lacey
Personal Website: http://macmanes.com/





Reply via email to