I'm seeing some sporadic strange behavior in one of our MPI codes. Here are selected portions of the output:
----------------------------------------------------------------------- | | |im |jm |km | phi0 | | iter | sync |mcalc | | |grp|itn|loc|loc|loc|Max Error| NSR |t(sec)|t(sec)|t(sec)| sysbal | ----------------------------------------------------------------------- 1 2 1 1 9 1.000E+00 1.000E+00 16.789 15.923 0.079 1.00E+00 1 3 1 1 5 1.000E+00 1.000E+00 16.800 15.935 0.078 1.00E+00 1 4 1 1 1 1.000E+00 1.000E+00 17.500 15.906 0.079 1.00E+00 ... 11 7 18 118 84 1.485E-01 1.117E+00 16.600 15.929 0.077 1.00E+00 11 8 20 124 84 1.516E-01 1.021E+00 16.600 15.929 0.077 1.00E+00 11 9 21 127 86 1.596E-01 1.053E+00 1.253 0.450 0.083 1.00E+00 11 10 7 131 88 1.290E-01 8.083E-01 0.808 0.014 0.272 1.00E+00 11 11 7 131 85 8.267E-02 6.408E-01 1.000 0.002 0.262 1.00E+00 ... 101 10 25 111 77 5.690E-02 8.179E-01 0.480 0.023 0.087 1.00E+00 101 11 32 113 77 4.782E-02 8.404E-01 0.479 0.023 0.087 1.00E+00 101 12 37 116 79 4.330E-02 9.055E-01 0.479 0.023 0.087 1.00E+00 This is an iterative calculation. The critical quantity of interest is "iter t(sec)", which is the time per iteration. (The other "t(sec)" quantities are subsets of "iter t(sec)".) Between "grp" 1 and 111, the calculation is not becoming appreciably more or less difficult, yet there is a factor of ~30 difference in performance between the beginning and the end. This problem does not appear all of the time. In many cases, performance is good throughout the entire calculation. ("Good", here, is being defined as what is seen in grp 101 above, which is roughly what I expect to be seeing.) However, when the problem does appear, it seems to mysteriously go away after grinding through the calculation for a while. Has anyone ever seen behavior like this? Any thoughts as to what could be causing it? I tried to recompile the code with mpif90-vt and mpicc-vt, in hopes that the vampirtrace outputs might shine some light as to the true nature of the problem. After recompiling, the code complains: [lx102:15254] *** An error occurred in MPI_Cart_create [lx102:15254] *** on communicator MPI_COMM_WORLD [lx102:15254] *** MPI_ERR_ARG: invalid argument of some other kind [lx102:15254] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) ...and then crashes out before doing anything useful. My understanding is that I only need to use the -vt compiler wrappers, and it will automatically "instrument" my code. Is there something else I should be doing? Thanks Greg