Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

vasilis Thu, 28 May 2009 03:26:16 -0400

> This is a problem of numerical stability, and there is no solution for
> such a problem in MPI. Usually,  preconditioning the input matrix
> improve the numerical stability.


It could be a numerical stability but this would imply that I have an ill-
conditioned matrix. This is not my case.

> If you read the MPI standard, there is a __short__ section about what
> guarantees the MPI collective communications provide. There is only
> one: if you run the same collective twice, on the same set of nodes
> with the same input data, you will get the same output. In fact the
> main problem is that MPI consider all default operations (MPI_OP) as
> being commutative and associative, which is usually the case in real
> world but not when floating point rounding is around. When you
> increase the number of nodes, the data will be spread in smaller
> pieces, which means more operations will have to be done in order to
> achieve the reduction, i.e. more rounding errors might occur and so on.

You could have a point if I would see these small differences in both matrices. 
I am solving the system Ax=b with the MUMPS libraries. The construction of the 
matrix A and the matrix-column b is distributed among np CPU. The matrix A is 
the same whether I use 2CPUs or np CPUs. The matrix b would slightly change if 
I use more than 2CPUs.

My data are not spread in smaller pieces!! I am using the FEM to solve the 
system of equations, and I use MPI to partition the domain. Therefore, the 
data (i.e., the vector of unknowns) is the same in all the CPUs, and each CPU 
is constructing a portion of the matrices A,b. Then, in the host CPU I add all 
these pieces into A and b.

Thank you,
Vasilis

>
>    Thanks,
>      george.
>
> On May 27, 2009, at 11:16 , vasilis wrote:
> >> Rank 0 accumulates all the res_cpu values into a single array,
> >> res.  It
> >> starts with its own res_cpu and then adds all other processes.  When
> >> np=2, that means the order is prescribed.  When np>2, the order is no
> >> longer prescribed and some floating-point rounding variations can
> >> start
> >> to occur.
> >
> > Yes you are right. Now, the question is why would these floating-
> > point rounding
> > variations occur for np>2? It cannot be  due to a not prescribed
> > order!!
> >
> >> If you want results to be more deterministic, you need to fix the
> >> order
> >> in which res is aggregated.  E.g., instead of using MPI_ANY_SOURCE,
> >> loop
> >> over the peer processes in a specific order.
> >>
> >> P.S.  It seems to me that you could use MPI collective operations to
> >> implement what you're doing.  E.g., something like:
> >
> > I could use these operations for the res variable (Will it make the
> > summation
> > any faster?). But, I can not use them for the other 3 variables.
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Reply via email to