vasilis wrote:
First, I may not understand what the problem is. I guess at np>2, you see small (10^-10) variations in results. Is that from run-to-run or just from np=4 compared to np=2.The original issue, still reflected by the subject heading of this e-mail, was that a message overran its receive buffer. That was fixed by using tags to distinguish different kinds of messages (res, jacob, row, and col).I thought the next problem was the small (10^-10) variations in results when np>2. In my mind, a plausible explanation for this is that you're adding the "res_cpu" contributions from all the various processes to the "res" array in some arbitrary order. The contribution from rank 0 is added in first, but all the others come in in some nondeterministic order. Since you're using finite-precision arithmetic, this can lead to tiny round-off variations. If you want to get rid of those minor variations, you have to perform floating-point arithmetic in a particular order.Unfortunately it did not work. I replaced the "MPI_ANY_SOURCE" with "JW" but I did not see any difference. 10^-10 variations in floating-point results sounds like floating-point roundoff. To get bitwise fp reproducibility, you need to execute arithmetic operations in a fixed order (among other conditions). That's why I suggested fixing the order of the res += sum(res_cpu) computation. But that doesn't guarantee you've succeeded. If you decompose the problem among multiple processes, that could change results. E.g., if you do sums within each process before summing results from different processes, then the number of processes will impact your fp bitwise reproducibility. Good luck. You might have to come up with criteria for judging an answer "correct" or "incorrect" rather than simply comparing bit-by-bit to a previous fp result. |
- Re: [OMPI users] "An error occurred in MPI_Recv" ... vasilis
- Re: [OMPI users] "An error occurred in MPI_Recv&q... Eugene Loh
- Re: [OMPI users] "An error occurred in MPI_Recv&q... George Bosilca
- Re: [OMPI users] "An error occurred in MPI_Re... Damien Hocking
- Re: [OMPI users] "An error occurred in MPI_Re... Eugene Loh
- Re: [OMPI users] "An error occurred in MP... vasilis
- Re: [OMPI users] "An error occurred in MP... Eugene Loh
- Re: [OMPI users] "An error occurred in MP... vasilis
- Re: [OMPI users] "An error occurred in MP... Eugene Loh
- Re: [OMPI users] "An error occurred in MPI_Re... vasilis