vasilis wrote:
 The original issue, still reflected by the subject heading of this e-mail,
was that a message overran its receive buffer.  That was fixed by using
tags to distinguish different kinds of messages (res, jacob, row, and col).

 I thought the next problem was the small (10^-10) variations in results
when np>2.  In my mind, a plausible explanation for this is that you're
adding the "res_cpu" contributions from all the various processes to the
"res" array in some arbitrary order.  The contribution from rank 0 is added
in first, but all the others come in in some nondeterministic order.  Since
you're using finite-precision arithmetic, this can lead to tiny round-off
variations.

 If you want to get rid of those minor variations, you have to perform
floating-point arithmetic in a particular order.
    
Unfortunately it did not work. I replaced the "MPI_ANY_SOURCE"  with "JW" but 
I did not see any difference.
  
First, I may not understand what the problem is.  I guess at np>2, you see small (10^-10) variations in results.  Is that from run-to-run or just from np=4 compared to np=2.

10^-10 variations in floating-point results sounds like floating-point roundoff.  To get bitwise fp reproducibility, you need to execute arithmetic operations in a fixed order (among other conditions).  That's why I suggested fixing the order of the res += sum(res_cpu) computation.  But that doesn't guarantee you've succeeded.  If you decompose the problem among multiple processes, that could change results.  E.g., if you do sums within each process before summing results from different processes, then the number of processes will impact your fp bitwise reproducibility.

Good luck.  You might have to come up with criteria for judging an answer "correct" or "incorrect" rather than simply comparing bit-by-bit to a previous fp result.

Reply via email to