Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Damien Hocking Wed, 27 May 2009 12:47:17 -0400

I've seen this behaviour with MUMPS on shared-memory machines as wellusing MPI. I use the iterative refinement capability to sharpen thelast few digits of the solution ( 2 or 3 iterations is usually enough).If you're not using that, give it a try, it will probably reduce thenoise you're getting in your results. The quality of the answer from adirect solve is highly dependent on the matrix scaling and pivot orderand it's easy to get differences in the last few digits. MUMPS itselfis also asynchronous, and might not be completely deterministic in howit solves if MPI processes can run in a different order.

Damien

George Bosilca wrote:

This is a problem of numerical stability, and there is no solution forsuch a problem in MPI. Usually, preconditioning the input matriximprove the numerical stability.
If you read the MPI standard, there is a __short__ section about whatguarantees the MPI collective communications provide. There is onlyone: if you run the same collective twice, on the same set of nodeswith the same input data, you will get the same output. In fact themain problem is that MPI consider all default operations (MPI_OP) asbeing commutative and associative, which is usually the case in realworld but not when floating point rounding is around. When youincrease the number of nodes, the data will be spread in smallerpieces, which means more operations will have to be done in order toachieve the reduction, i.e. more rounding errors might occur and so on.
  Thanks,
    george.

On May 27, 2009, at 11:16 , vasilis wrote:
Rank 0 accumulates all the res_cpu values into a single array, res.  It
starts with its own res_cpu and then adds all other processes.  When
np=2, that means the order is prescribed.  When np>2, the order is no
longer prescribed and some floating-point rounding variations can start
to occur.
Yes you are right. Now, the question is why would thesefloating-point rounding
variations occur for np>2? It cannot be  due to a not prescribed order!!
If you want results to be more deterministic, you need to fix the order
in which res is aggregated. E.g., instead of using MPI_ANY_SOURCE,loop
over the peer processes in a specific order.
P.S.  It seems to me that you could use MPI collective operations to
implement what you're doing.  E.g., something like:
I could use these operations for the res variable (Will it make thesummation
any faster?). But, I can not use them for the other 3 variables.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Reply via email to