1.4.3 is fairly ancient.

Can you upgrade to 1.6.5?

On Jul 26, 2013, at 3:15 AM, Dusan Zoric <dusan.zo...@gmail.com> wrote:

> 
> I am running application that performs some transformations of large matrices 
> on 7-node cluster. Nodes are connected via QDR 40 Gbit Infiniband. Open MPI 
> 1.4.3 is installed on the system.
> 
> Given matrix transformation requires large data exchange between nodes in 
> such a way that at each algorithm step there is one node that sends data and 
> all others receive. Number of processes is equal to number of nodes used. I 
> have to say that I am relatively new at MPI, but it seemed that ideal way of 
> performing this is by using MPI_Bcast.
> 
> Everything worked fine for some not so large matrices. However, when matrix 
> size increases, at some point application hangs and stays there forever.
> 
> I am not completely sure, but it seems like there is no errors in my code. I 
> traced it in detail in order to check if there are some uncompleted 
> collective operations before that specific call of MPI_Bcast, but everything 
> looks fine. Also, for that specific call, root is correctly set in all 
> processes, as well as message type and size, and, of course, MPI_Bcast is 
> called in all processes.
> 
> I also ran a lot of scenarios (running application on matrices of different 
> sizes and changing the number of processes) in order to figure out when this 
> happens. What can be observed is the following:
> 
>       • for the matrix of the same size, application successfully finishes if 
> I decrees number of processes
>       • however, for given number of processes application will hang for some 
> slightly larger matrix
>       • for the given matrix size and number of processes where I have 
> program hanging, if I reduce the size of the message in each MPI_Bcat call 
> twice (of course the result will not be correct), there will not be hanging
> So, it seems to me that problem could be in some buffers that MPI uses, and 
> maybe some default MCA parameter should be changed, but, as I said, I do not 
> have a lot of experience in MPI programming, and I have not found solution 
> for this problem. So, the question is whether anyone has had a similar 
> problem, and maybe knows if this could be solved by setting appropriate MCA 
> parameter, or knows any other solution or explanation? 
> 
> Thanks,
> 
> Dusan Zoric
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to