1.4.3 is fairly ancient. Can you upgrade to 1.6.5?
On Jul 26, 2013, at 3:15 AM, Dusan Zoric <dusan.zo...@gmail.com> wrote: > > I am running application that performs some transformations of large matrices > on 7-node cluster. Nodes are connected via QDR 40 Gbit Infiniband. Open MPI > 1.4.3 is installed on the system. > > Given matrix transformation requires large data exchange between nodes in > such a way that at each algorithm step there is one node that sends data and > all others receive. Number of processes is equal to number of nodes used. I > have to say that I am relatively new at MPI, but it seemed that ideal way of > performing this is by using MPI_Bcast. > > Everything worked fine for some not so large matrices. However, when matrix > size increases, at some point application hangs and stays there forever. > > I am not completely sure, but it seems like there is no errors in my code. I > traced it in detail in order to check if there are some uncompleted > collective operations before that specific call of MPI_Bcast, but everything > looks fine. Also, for that specific call, root is correctly set in all > processes, as well as message type and size, and, of course, MPI_Bcast is > called in all processes. > > I also ran a lot of scenarios (running application on matrices of different > sizes and changing the number of processes) in order to figure out when this > happens. What can be observed is the following: > > • for the matrix of the same size, application successfully finishes if > I decrees number of processes > • however, for given number of processes application will hang for some > slightly larger matrix > • for the given matrix size and number of processes where I have > program hanging, if I reduce the size of the message in each MPI_Bcat call > twice (of course the result will not be correct), there will not be hanging > So, it seems to me that problem could be in some buffers that MPI uses, and > maybe some default MCA parameter should be changed, but, as I said, I do not > have a lot of experience in MPI programming, and I have not found solution > for this problem. So, the question is whether anyone has had a similar > problem, and maybe knows if this could be solved by setting appropriate MCA > parameter, or knows any other solution or explanation? > > Thanks, > > Dusan Zoric > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/