Hello all, I have finally solved the issue, or as it should be said, discovered my oversight. And it's a mistake that will have je mad at myself for a while. I'm new to MPI, though, and not versed in the MPP communications of LS-DYNA at all though, so it was an oversight easily made.
The key to fixing the entire situation was the test input file I was using. LS-DYNA accepts input files that contain all of the data that tell LS-DYNA what to do with the simulation. So I would invoke mpi as such: mpirun -np 16 mppLSDYNA input=myfile.k . That's not related to the issue but important to differentiate between the mpi program input and ls dyna application input. Anyhow, I made up a simple collision simulation in LSDYNA to use as a test file (~15 kB) because our typical jobs have very large files (50-150MB) that have very long run times (often 7+ days). Therefore I chose a simple analysis that could be executed fast so I could see data from all parts of it and how OpenMPI behaved during the entire simulation...and that's where the problem was. (I have read in various places that MPI_Allreduce is LS-DYNA's heavy hitter in the MPI communications and that is why I hypothesize the following:) The MPP communications of LS-DYNA do an MPI_Allreduce to coordinate for EVERY or very nearly every iteration of the program. My executable file ran so fast that it was completing 5000 iterations within a single second on a single core (I found this out very recently, minutes ago in-fact, when I was testing mpirun with only two cores locally). And that was where my network tie ups were happening. I started measuring throughputs of the 16 core, 8 core, 4 core, and 2 core jobs over the network and was shocked to see that 16 cores was capping my network out at 120 Mbits/sec. 8 cores was also using 120 Mbits/sec, 4 cores used 75 Mbits/sec and 2 cores used around 30 or 40 Mbits/sec. Needless to say, it finally clicked in my brain a few minutes ago, and I started up a 16 core job of our standard issue file once I realized that the communications were just happening too often, and not that they were taking a long time. I had the right idea initially, because typically the issue of the subroutines taking a long time is worrisome, but, with very repetitive and iterative programs comes the need for them to coordinate on a continuous and rapid basis. The 16 core job file I started up typically takes 100-120 hours, and typically runs on 8 cores for that amount of time (SMP). When I started this OpenMPI job, LS-DYNA gave me an estimate of 43 hours! This earns OpenMPI some great respect, quite a powerful program once setup correctly. As a side note, the throughput of this job was around 17 Mbits/sec. All in all, easily fixed, just a few days of frustration. Thank you all again for all of your help. It was paramount in enabling me to discover the issue. Thanks again. Best Regards, Robert Walters --- On Wed, 7/14/10, Eugene Loh <eugene....@oracle.com> wrote: From: Eugene Loh <eugene....@oracle.com> Subject: [OMPI users] LS-DYNA profiling [was: OpenMPI Hangs, No Error] To: "Open MPI Users" <us...@open-mpi.org> List-Post: users@lists.open-mpi.org Date: Wednesday, July 14, 2010, 2:26 PM I started today reading e-mail quickly and out of order. So, I'm going back to an earlier message now, but still with the new Subject heading, which better reflects where you are in your progress. I'm extracting some questions from this thread, from bottom/old to top/new: 1) What tools to use? As others have commented, the issue may have less to do with the kernel or inefficient data movement and more to do with what's going on in your application -- that is, some processes are reaching synchronization points earlier than others. For application-level tools, there is an OMPI FAQ category at http://www.open-mpi.org/faq/?category=perftools and specifically a list of some tools you could "download for free". Anyhow, sounds like you're getting traction with Sun Studio -- and personally I think that's a good choice. :^) 2) Spread processes on multiple nodes or colocate on one node? It depends, but I agree with Ralph that it's not surprising if running on a single node is faster, presumably due to faster communication performance. (But, I'd temper his statement that it would *always* be so.) 3) I agree with David and Ralph that you're probably spending more time waiting on messages than actually moving data. If you use Sun Studio with Sun ClusterTools (now Oracle Message Passing Toolkit), it'll also break apart "MPI wait" time from "MPI work" time (time spent moving data) and you could see this rather clearly. But, any of the MPI tracing tools will be able to show you other indications of this problem. Most graphically, you might look at an MPI timeline. E.g., you might see one or more functions stuck in a long MPI_Barrier call, with other processes lingering in computation before the last process enters the barrier and all processes are released from that barrier. Or, one function stuck in a long MPI_Recv, released when the message arrives, as indicated by a message line joining the sending and receiving processes. 4) I could imagine the distributed memory version of LS-DYNA running faster than the shared-memory version since it has better data locality. 5) The I/O portion might be very, very significant. Maybe processes are waiting on other processes that are not computing but doing a lot of I/O. You say runs are faster when all processes are on the same node versus when they're on different nodes. One experiment would be to compare all processes running on node A versus running all processes on node B versus distributing processes among both A and B. This would be a way of discriminating between "all on one node is faster than distributed" versus "one node is faster than another". Alternatively, if look at the timeline and see one process stuck in a long MPI call, look at the process it's waiting on. Does the call stack of the "laggard" suggest what that laggard is doing? Computation or I/O? You might need to know a little about LS-DYNA to tell. 6) Functions with "_" at the end are probably Fortran names, which should make sense for LS-DYNA (which I think is basically Fortran). 7) Regarding two sets of functions for "everything." I think this is no surprise. There are "inclusive" and "exclusive" times. If you look at "inclusive" times (time spent in a function and all its callees), the time for a function will be almost exactly equal to the time spent in a wrapper that calls that function. I think this is what you're seeing. If you were to look instead at "exclusive" times (time spent in a function, excluding time spent in its callees), the difference between a "real" function and a wrapper that calls it would be very clear. If the numbers are percentages, then it's clear they are "inclusive" since they add up to well over 100%. 8) Regarding your earlier problems using Studio: no, you shouldn't need to build OMPI specially for Studio use. (I'm interested in hearing more about the problems you encountered and how you resolved them. You can send to me off line since I suspect most of the mail list would not be interested.) I hope at least some of those comments are helpful. Good luck. Robert Walters wrote: I think I forgot to mention earlier that the application I am using is pre-compiled. It is a finite element software called LS-DYNA. It is not open source and I likely cannot obtain the code it uses for MPP. This version I am using was specifically compiled, by the parent company, for OpenMPI 1.4.2 MPP operations. I recently installed the Sun Studio 12.1 to attempt to analyze the situation. It seems to work partially. It will record various processes individually, which is cryptic. The function it fails on, though, is the MPI Tracing. It errors that "no MPI tracing data file in experiment, MPI Timeline and MPI Charts will not be available". Sometime during the analysis (about 10,000 iterations later, the VT_MAX_FLUSHES complains that there are too many i/o flushes and its not happy. I've increased this number in the environmental variable and killed the analysis before it had a chance to error but still no MPI Trace data is recorded. Not sure if you guys have heard of that happening or know any way to fix it...Did OpenMPI need to be configured/built for Sun Studio use? I also noticed that from the data I do get back, there are two sets of functions for everything. There is mpi_recv and then my_recv_, both with the same % utilization time. The mpi one comes from your program's library and the my_recv_ one comes from my program. Is that typical or should the program I'm using be saying mpi_recv only? This data may be enough to help me see what's wrong so I will pass it along. Keep in mind this is percent time of total run time and not percent of MPI communication. I attached the information in a picture rather than me attempting to format a nice table in this nasty e-mail application. I blacked out items that are related to LS-DYNA but afterward I just realized that I think every function with an _ at the end represents a command issuing from LS-DYNA. These are my big spenders. The processes I did not include are in the bottom 4%. The processes that would be above these were the LS-DYNA applications at 100%. Like I mentioned earlier, there are two instances of every MPI command, and they carry the same percent usage. It's curious that this version, built for OpenMPI, uses different functions. Just for a little more background info, OpenMPI is being launched from a local hard drive on each machine, but the LS-DYNA job files, and related data output files, are on a mounted drive on that machine, where the mounted drive is located on a different machine also in the cluster. We were thinking that might be an issue but it isn't writing enough data for me to think that would significantly decrease MPP performance. I would like to make one last mention. That is that OpenMPI running 8 cores on a single node, with all the communication, works flawlessly. It works much faster than the Shared Memory Parallel (SMP) version of LS-DYNA that we currently have used scaled to 8 cores. LS-DYNA seems to be approximately 25% faster (don't quote me on that) when using the OpenMPI installation than when using the standard SMP, which is awesome. My point being that OpenMPI seems to be working fine, even with the screwy mounted drive. This leads me to continue to point at the network. Anyhow, let me know if anything seems weird on the OpenMPI communication subroutines. I don't have any numbers to lean on from experience. From: David Zhang <solarbik...@gmail.com> Subject: Re: [OMPI users] OpenMPI Hangs, No Error To: "Open MPI Users" <us...@open-mpi.org> Date: Tuesday, July 13, 2010, 9:42 PM Like Ralph says, the slow down may not be coming from the kernel, but rather on waiting for messages. What MPI send/recv commands are you using? On Tue, Jul 13, 2010 at 11:53 AM, Ralph Castain <r...@open-mpi.org> wrote: I'm afraid that having 2 cores on a single machine will always outperform having 1 core on each machine if any communication is involved. The most likely thing that is happening is that OMPI is polling waiting for messages to arrive. You might look closer at your code to try and optimize it better so that number-crunching can get more attention. On Jul 13, 2010, at 12:22 PM, Robert Walters wrote: Following up. The sysadmin opened ports for machine to machine communication and OpenMPI is running successfully with no errors in connectivity_c, hello_c, or ring_c. Since, I have started to implement our MPP software (finite element analysis) that we have, and upon running a simple, 1 core on machine1, 1 core on machine2, job, I notice it is considerably slower than a 2 core job on a single machine. A quick look at top shows me kernel usage is almost twice what cpu usage is! On a 16 core job, (8 cores per node so 2 nodes total) test, OpenMPI was consuming ~65% of the cpu for kernel related items rather than number-crunching related items...Granted, we are running on GigE, but this is a finite element code we are running with no heavy data transfer within it. I'm looking into benchmarking tools, but my sysadmin is not very open to installing third party softwares. Do you have any suggestions for what I can use that would be "big name" or guaranteed safe tools I can use to figure out what's causing the hold up with all the kernel usage? I'm pretty sure its network traffic but I have no way of telling (as far as I know because I'm not a Linux whiz) with the standard tools in RHEL. -----Inline Attachment Follows----- _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users