Hello, I've been using openmpi (version 1.3.2) for some time, but recently have had greater than 1000 cores available. My code runs fine with 1000 cores but fails when attempting to use 1200 cores.
The only information at the time of the crash is: <program exited with code 021>. Within the debugger I know the crash is occurring on an MPI_Send call. After inserting printf diagnostics I know the following... I have a master/slave application with a 'synchronization' step occurring during initialization. The master is using MPI_Send to send a single integer to all of the slaves. I see most of the slave's printing a diagnostic and then sitting on the MPI_Recv. Then I see the master (finally getting to the 'home-grown broadcast') and starting to issue MPI_Send to each slave. After (in this case) 1019 sends the crash occurs. I'm looking for information on the cause, I'm guessing some kind of a message-passing buffer is being overrun, and hints on how to avoid these types of bottlenecks when running on clusters with multiple of thousand of cores. thanks !! Tim Thompson