Thanks Ashley, I'll try your tool.. I would think that this is an error in the programs I am trying to use, too, but this is a problem with 2 different programs, written by 2 different groups.. One of them might be bad, but both.. seems unlikely.
Interestingly the results for the connectivity_c test that is included with OMPI... works fine with -np <8. For -np >8 it works some of the time, other times it HANGS. I have got to believe that this is a big clue!! Also, when it hangs, sometimes I get the message "mpirun was unable to cleanly terminate the daemons on the nodes shown below" Note that NO nodes are shown below. Once, I got -np 250 to pass the connectivity test, but I was not able to replicate this reliable, so I'm not sure if it was a fluke, or what. Here is a like to a screenshop of TOP when connectivity_c is hung with -np 14.. I see that 2 processes are only at 50% CPU usage.. Hmmmm http://picasaweb.google.com/lh/photo/87zVEucBNFaQ0TieNVZtdw?authkey=Gv1sRgCLKokNOVqo7BYw&feat=directlink The other tests, ring_c, hello_c, as well as the cxx versions of these guys with with all values of -np. Unfortunately, I could not get valgrind to work... Thanks, Matt On Dec 9, 2009, at 2:37 AM, Ashley Pittman wrote: > On Tue, 2009-12-08 at 08:30 -0800, Matthew MacManes wrote: >> There are 8 physical cores, or 16 with hyperthreading enabled. > > That should be meaty enough. > >> 1st of all, let me say that when I specify that -np is less than 4 >> processors (1, 2, or 3), both programs seem to work as expected. Also, >> the non-mpi version of each of them works fine. > > Presumably the non-mpi version is serial however? this this doesn't mean > the program is bug-free or that the parallel version isn't broken. > There are any number of apps that don't work above N processes, in fact > probably all programs break for some value of N, it's normally a little > higher then 3 however. > >> Thus, I am pretty sure that this is a problem with MPI rather that >> with the program code or something else. >> >> What happens is simply that the program hangs.. > > I presume you mean here the output stops? The program continues to use > CPU cycles but no longer appears to make any progress? > > I'm of the opinion that this is most likely a error in your program, I > would start by using either valgrind or padb. > > You can run the app under valgrind using the following mpirun options, > this will give you four files named v.log.0 to v.log.3 which you can > check for errors in the normal way. The "--mca btl tcp,self" option > will disable shared memory which can create false positives. > > mpirun -n 4 --mca btl tcp,self valgrind --log-file=v.log.% > q{OMPI_COMM_WORLD_RANK} <app> > > Alternatively you can run the application, wait for it to hang and then > in another window run my tool, padb, which will show you the MPI message > queues and stack traces which should show you where it's hung, > instructions and sample output are on this page. > > http://padb.pittman.org.uk/full-report.html > >> There are no error messages, and there is no clue from anything else >> (system working fine otherwise- no RAM issues, etc). It does not hang >> at the same place everytime, sometimes in the very beginning, sometime >> near the middle.. >> >> Could this an issue with hyperthreading? A conflict with something? > > Unlikely, if there was a problem in OMPI running more than 3 processes > it would have been found by now. I regularly run 8 process applications > on my dual-core netbook alongside all my desktop processes without > issue, it runs fine, a little slowly but fine. > > All this talk about binding and affinity won't help either, process > binding is about squeezing the last 15% of performance out of a system > and making performance reproducible, it has no bearing on correctness or > scalability. If you're not running on a dedicated machine which with > firefox running I guess you aren't then there would be a good case for > leaving it off anyway. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _________________________________ Matthew MacManes PhD Candidate University of California- Berkeley Museum of Vertebrate Zoology Phone: 510-495-5833 Lab Website: http://ib.berkeley.edu/labs/lacey Personal Website: http://macmanes.com/