I'm seeing some sporadic strange behavior in one of our MPI codes.  Here are
selected portions of the output:

-----------------------------------------------------------------------
|   |   |im |jm |km |  phi0   |         | iter | sync |mcalc |        |
|grp|itn|loc|loc|loc|Max Error|   NSR   |t(sec)|t(sec)|t(sec)| sysbal |
-----------------------------------------------------------------------
   1   2   1   1   9 1.000E+00 1.000E+00 16.789 15.923  0.079 1.00E+00
   1   3   1   1   5 1.000E+00 1.000E+00 16.800 15.935  0.078 1.00E+00
   1   4   1   1   1 1.000E+00 1.000E+00 17.500 15.906  0.079 1.00E+00
...
  11   7  18 118  84 1.485E-01 1.117E+00 16.600 15.929  0.077 1.00E+00
  11   8  20 124  84 1.516E-01 1.021E+00 16.600 15.929  0.077 1.00E+00
  11   9  21 127  86 1.596E-01 1.053E+00  1.253  0.450  0.083 1.00E+00
  11  10   7 131  88 1.290E-01 8.083E-01  0.808  0.014  0.272 1.00E+00
  11  11   7 131  85 8.267E-02 6.408E-01  1.000  0.002  0.262 1.00E+00
...
 101  10  25 111  77 5.690E-02 8.179E-01  0.480  0.023  0.087 1.00E+00
 101  11  32 113  77 4.782E-02 8.404E-01  0.479  0.023  0.087 1.00E+00
 101  12  37 116  79 4.330E-02 9.055E-01  0.479  0.023  0.087 1.00E+00

This is an iterative calculation.  The critical quantity of interest is
"iter t(sec)", which is the time per iteration.  (The other "t(sec)"
quantities are subsets of "iter t(sec)".)  Between "grp" 1 and 111, the
calculation is not becoming appreciably more or less difficult, yet there is
a factor of ~30 difference in performance between the beginning and the
end.  This problem does not appear all of the time.  In many cases,
performance is good throughout the entire calculation.  ("Good", here, is
being defined as what is seen in grp 101 above, which is roughly what I
expect to be seeing.)  However, when the problem does appear, it seems to
mysteriously go away after grinding through the calculation for a while.

Has anyone ever seen behavior like this?  Any thoughts as to what could be
causing it?

I tried to recompile the code with mpif90-vt and mpicc-vt, in hopes that the
vampirtrace outputs might shine some light as to the true nature of the
problem.  After recompiling, the code complains:

[lx102:15254] *** An error occurred in MPI_Cart_create
[lx102:15254] *** on communicator MPI_COMM_WORLD
[lx102:15254] *** MPI_ERR_ARG: invalid argument of some other kind
[lx102:15254] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

...and then crashes out before doing anything useful.  My understanding is
that I only need to use the -vt compiler wrappers, and it will automatically
"instrument" my code.  Is there something else I should be doing?

Thanks
Greg

Reply via email to