Network saturation could produce arbitrary long delays the total data load we are talking about is really small. It is the responsibility of an MPI library to do one of the following:
1) Use a reliable message protocol for each message (e.g. Infiniband RC or TCP/IP) 2) detect lost packets and retransmit them if the protocol is un-reliable (E.G. Infiniband UD or UDP/IP) It is the responsibility of an MPI library to manage any MPI or system buffers to prevent over run. That is why I mention that 1/2 MB messages would use rendezvous protocol. The send side would push a descriptor (called an envelop) to the receive side. The receive side would push back an OK_to_send once a matching receive was posted. The 1/2 MB message data would not begin to flow across the network until the receive buffer was known. It is also the responsibility of an MPI library to detect when MPI level messages have become undeliverable and fail the job. Bugs are always a possibility but unless there is something very unusual about the cluster and interconnect or this is an unstable version of MPI, it seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2 MB message would trip on one. 80 tasks is a very small number in modern parallel computing. Thousands of tasks involved in an MPI collective has become pretty standard. Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 08/23/2010 09:39:29 PM: > > I have had a similar load related problem with Bcast. I don't know > what caused it though. With this one, what about the possibility of > a buffer overrun or network saturation? > >