I try a couple of things including your suggestion. I also find out this has been reported before, http://www.open-mpi.org/community/lists/users/2007/03/2904.php but there seems to be no clear solution so far:
Here is what I observe: I keep the problem size fixed with 24 processes. I use two nodes with 8-core each and 2-core each. 1. When it is oversubscribed (12 process/processor), sys vs. user time is much higher than less-subscribed (1.5 process/processor). Almost The wall clock does not improve too much :-( 2. I try following options, individually and collectively, no difference mpirun --mpi_yield_when_idle 1 --mca btl tcp,sm,self --mca coll_hierarch_priority 100 ... 3. older openmpi version (1.3) seems to be better than new version (1.3.2), but not significantly. By the way, I am working on Amazon EC2 (VM-host). Will that make any difference? Please advise Thanks On Fri, Jun 26, 2009 at 11:28 PM, Ralph Castain <r...@open-mpi.org> wrote: > If you are running fewer processes on your nodes than they have processors, > then you can improve performance by adding > > -mca mpi_paffinity_alone 1 > > to your cmd line. This will bind your processes to individual cores, which > helps with latency. If your program involves collectives, then you can try > setting > > -mca coll_hierarch_priority 100 > > This will activate the hierarchical collectives, which utilize shared > memory for messages between procs on the same node. > > Ralph > > > > On Jun 26, 2009, at 9:09 PM, Qiming He wrote: > > Hi all, >> >> I am new to OpenMPI, and have an urgent run-time question. I have >> openmpi-1.3.2 compiled with Intel Fortran compiler v.11 simply by >> >> ./configure --prefix=<my-dir> F77=ifort FC=ifort >> then I set my LD_LIBRARY_PATH to include <openmpi-lib> and <intel-lib> >> and compile my Fortran program properly. No compilation error. >> >> I run my program on single node. Everything looks ok. However, when I run >> it on multiple nodes. >> mpirun -np <num> --hostfile <my-hosts> <my-program> >> The performance is much worse than a single node with the same size of the >> problem to solve (MPICH2 has 50% improvement) >> >> I use top and saidar to find that user time (CPU user) is much lower than >> system time (CPU system), i.e, >> only small portion of CPU time is used by user application, while the rest >> is busy with system. >> No wonder I got bad performance. I am assuming "CPU system" is used for >> MPI communication. >> I notice the total traffic (on eth0) is not that big (~5Mb/sec). What is >> CPU system busy for? >> >> Can anyone help? Anything I need to tune? >> >> Thanks in advance >> >> -Qiming >> >> >> >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >