You might want to run some profiling / timing to see what parts of the application start running slower over time.
Also check for memory leaks. On Sep 22, 2011, at 5:44 PM, Tom Hilinski wrote: > Hi, A job I am running slows down as it approaches the end. I'd > appreciate any ideas you may have on possible cause or what else I can > look at for diagnostic info. > > Environment: > * Linux cluster, very recent version of Fedora. > * openmpi 1.5 > > Characteristics of job: > * Tasks are all the same size and duration. > * 56K tasks, but multiple tasks given to each process. > * Typically run 120 processes. > * Slowdown starts at ~52K completed, then rate of completion of each > task declines geometrically from ~1k/minute to 4/minute at 54K. > > Here are some queries done when the slowdown occurs: > > * "ps" on master node - most processes in suspend state: > F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > 0 S 3348 27933 15675 0 80 0 - 13608 poll_s pts/0 00:00:00 mpiexec > 0 S 3348 28009 27933 14 80 0 - 227632 epoll_ pts/0 00:08:13 C5MPI > 0 S 3348 28011 27933 14 80 0 - 227672 epoll_ pts/0 00:08:17 C5MPI > 0 S 3348 28013 27933 13 80 0 - 227713 epoll_ pts/0 00:08:06 C5MPI > 0 S 3348 28015 27933 13 80 0 - 227844 epoll_ pts/0 00:08:02 C5MPI > 0 S 3348 28017 27933 14 80 0 - 227849 epoll_ pts/0 00:08:13 C5MPI > 0 S 3348 28019 27933 13 80 0 - 227892 epoll_ pts/0 00:08:07 C5MPI > > * file handles (allocated handle count is ~constant): > $ cat /proc/sys/fs/file-nr > 3968 0 801014 > > * Processes in a suspend or run state (varies): > $ orte-top -pid 27933 | grep ' S |' | wc -l > 124 > $ orte-top -pid 27933 | grep ' R |' > Rank | Nodename | Command | Pid | State | Time | Pri | #threads | > Vsize | RSS | Peak Vsize | Shr Size | > 0 | rubel-001 | C5MPI | 14700 | R | 2.2H | 20 | 1 | > 246208 | 12660 | 246208 | 17664 | > 1 | rubel-001 | C5MPI | 14702 | R | 2.2H | 20 | 1 | > 245360 | 44860 | 245360 | 17664 | > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/