Hello,

I've been using openmpi (version 1.3.2) for some time, but recently have 
had greater than 1000 cores available.
My code runs fine with 1000 cores but fails when attempting to use 1200 
cores.

The only information at the time of the crash is:  <program exited with 
code 021>.

Within the debugger I know the crash is occurring on an MPI_Send call.
After inserting printf diagnostics I know the following... 

I have a master/slave application with a 'synchronization' step occurring 
during initialization.
The master is using MPI_Send to send a single integer to all of the 
slaves.
I see most of the slave's printing a diagnostic and then sitting on the 
MPI_Recv.

Then I see the master (finally getting to the 'home-grown broadcast') and 
starting to issue MPI_Send to each slave.
After (in this case) 1019 sends the crash occurs.

I'm looking for information on the cause, I'm guessing some kind of a 
message-passing buffer is being overrun, 
and hints on how to avoid these types of bottlenecks when running on 
clusters with multiple of thousand
of cores.

thanks !!
Tim Thompson

Reply via email to