Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

vasilis Thu, 28 May 2009 03:58:02 -0400

On Wednesday 27 of May 2009 7:47:06 pm Damien Hocking wrote:
> I've seen this behaviour with MUMPS on shared-memory machines as well
> using MPI.  I use the iterative refinement capability to sharpen the
> last few digits of the solution ( 2 or 3 iterations is usually enough).
> If you're not using that, give it a try, it will probably reduce the
> noise you're getting in your results.  The quality of the answer from a
> direct solve is highly dependent on the matrix scaling and pivot order
> and it's easy to get differences in the last few digits.  MUMPS itself
> is also asynchronous, and might not be completely deterministic in how
> it solves if MPI processes can run in a different order.


I set the maximum step of refinement to 5. It did change the solution, but it 
is not the same when I run it with 2CPUs



> Damien
>
> George Bosilca wrote:
> > This is a problem of numerical stability, and there is no solution for
> > such a problem in MPI. Usually,  preconditioning the input matrix
> > improve the numerical stability.
> >
> > If you read the MPI standard, there is a __short__ section about what
> > guarantees the MPI collective communications provide. There is only
> > one: if you run the same collective twice, on the same set of nodes
> > with the same input data, you will get the same output. In fact the
> > main problem is that MPI consider all default operations (MPI_OP) as
> > being commutative and associative, which is usually the case in real
> > world but not when floating point rounding is around. When you
> > increase the number of nodes, the data will be spread in smaller
> > pieces, which means more operations will have to be done in order to
> > achieve the reduction, i.e. more rounding errors might occur and so on.
> >
> >   Thanks,
> >     george.
> >
> > On May 27, 2009, at 11:16 , vasilis wrote:
> >>> Rank 0 accumulates all the res_cpu values into a single array, res.  It
> >>> starts with its own res_cpu and then adds all other processes.  When
> >>> np=2, that means the order is prescribed.  When np>2, the order is no
> >>> longer prescribed and some floating-point rounding variations can start
> >>> to occur.
> >>
> >> Yes you are right. Now, the question is why would these
> >> floating-point rounding
> >> variations occur for np>2? It cannot be  due to a not prescribed order!!
> >>
> >>> If you want results to be more deterministic, you need to fix the order
> >>> in which res is aggregated.  E.g., instead of using MPI_ANY_SOURCE,
> >>> loop
> >>> over the peer processes in a specific order.
> >>>
> >>> P.S.  It seems to me that you could use MPI collective operations to
> >>> implement what you're doing.  E.g., something like:
> >>
> >> I could use these operations for the res variable (Will it make the
> >> summation
> >> any faster?). But, I can not use them for the other 3 variables.
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Reply via email to