Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Rahul Nabar Thu, 26 Aug 2010 11:59:02 -0400

On Wed, Aug 25, 2010 at 12:14 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> Once you do that, try using just one of the networks by telling OMPI to use 
> only one of the devices, something like this:
>
>    mpirun --mca btl_tcp_if_include eth0 ...


Thanks Jeff! Just tried the exact test that you suggested.

[rpnabar@eu001 ~]$ NP=64;time mpirun  -np $NP --host
eu001,eu003,eu004,eu005,eu006,eu007,eu008,eu012 --mca
btl_tcp_if_include eth0  -mca btl openib,sm,self
/opt/src/mpitests/imb/src/IMB-MPI1 -npmin $NP  gather

Still the same problem. The NP64 gather stalls at 4096 for about 7
minutes and then completes with a step change increase in times. All
10GigE's are eth0 now and all on the 192.168.x.x. subnet. The 7 minute
stall time seems very reproducible each time around.

Once the test stalled I ran a padb stack trace from the master node.
Posted here:

[rpnabar@eu001 root]$ /opt/sbin/bin/padb --all --stack-trace --tree
--config-option rmgr=orte
http://dl.dropbox.com/u/118481/padb_Aug26_gather_NP64.txt

Did a top for the most cpu intensive processes during the stall and
the all seem the IMB-MPI ones. Memory usage seems minimal. (Each node
has 16 Gigs of RAM)
http://dl.dropbox.com/u/118481/top_Aug26.txt

Interestingly the NP56 test runs just great and finishes in less than
a minute. It's only at NP64 that I hit this roadblock. On the other
hand even for the NP56 test at the bytesize of 4096-->8192 there is
almost a 10x degradation in transmit times.

 Any other debug options or suggestions are most welcome!

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 64 gather

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Gather

#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 64
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.03         0.02
            1         1000        84.25        84.55        84.40
            2         1000        84.16        84.45        84.31
            4         1000        84.48        84.78        84.64
            8         1000        84.58        84.92        84.77
           16         1000        86.51        86.79        86.66
           32         1000        88.60        88.93        88.78
           64         1000        90.88        91.22        91.06
          128         1000        92.44        92.76        92.60
          256         1000        95.79        96.14        95.98
          512         1000       104.90       105.25       105.07
         1024         1000       118.01       118.40       118.19
         2048         1000       154.42       154.94       154.67
         4096         1000       292.15       292.95       292.52
         8192           13      1436.77      1667.15      1581.73
        16384           13      1733.38      2004.77      1903.27
        32768           13      2082.55      2403.24      2282.68
        65536           13      3106.37      3546.15      3384.07
       131072           13      7812.54      9011.62      8572.76
       262144           13     10773.70     12358.30     11782.77
       524288           13     19377.23     22315.85     21238.98
      1048576           13     38661.61     44293.92     42280.09
      2097152           13    120665.00    140697.08    136576.54
      4194304           10    475155.12    567579.08    536037.92


# All processes entering MPI_Finalize


real    7m31.039s
user    58m58.321s
sys     0m21.633s

--------------------------------NP56
test------------------------------------------
#----------------------------------------------------------------
# Benchmarking Gather
# #processes = 56
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.02         0.09         0.03
            1         1000        74.23        74.53        74.35
            2         1000        73.87        74.15        74.02
            4         1000        73.59        73.86        73.72
            8         1000        74.15        74.40        74.27
           16         1000        76.18        76.45        76.30
           32         1000        77.82        78.10        77.95
           64         1000        79.85        80.16        80.00
          128         1000        81.67        82.01        81.84
          256         1000        86.07        86.41        86.27
          512         1000        94.91        95.23        95.07
         1024          843        33.45        35.13        34.38
         2048          843       218.82       241.49       230.18
         4096          843       130.76       131.62       131.17
         8192          843      1344.88      1348.68      1347.62
        16384          843      1915.72      1919.64      1918.58
        32768          843      2463.28      2469.58      2468.08
        65536          640      3395.59      3401.03      3398.49
       131072          320      6952.66      6981.24      6968.44
       262144          160     10137.25     10209.22     10174.13
       524288           80     16631.20     16921.68     16788.20
      1048576           40     35974.07     36980.07     36517.35
      2097152           20    167574.75    183295.25    177734.75
      4194304           10    321249.79    410697.10    367498.59

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

Reply via email to