Rob Latham wrote:
The standard in no way requires any overlap for either the nonblocking communication or I/O routines. There are long and heated discussions about "strict" or "weak" interpretation of the progress rule and which one is "better".
Unfortunate. But with your "official" statement, I can now put that issue behind me. Thanks :)
If you want asynchronous nonblocking I/O, you might have to roll all the way back to LAM or MPICH-1.2.7, when ROMIO used its own request objects and test/wait routines on top of the aio routines.
What if you moved your MPI_File_write call into a thread? There are several ways to do this: you could use standard generalized reqeusts and make progress with a thread -- the application writer has a lot more knowledge about the systems and how best to allocate threads.
The funny thing is that my code was supposed to be an instructive demo of MPI's asynchronous I/O APIs and their functionality. It's basically number crunching on a matrix, distributed in stripes across ranks. The parallel I/O would have all ranks write their stripe into one file, done with a subarray data type.
Adding any kind of threading would be practical and performing better, but not showing off MPI's I/O APIs. I'd rather keep the code as simple as it is, so people see the "other" benefits of MPI's APIs: they're higher-level, more convenient than rolling it by hand.
If I may ask a slightly different question: you've got periods of I/O and periods of computation. Have you evaluated collective I/O?
I thought about it and I know a way to make it happen too, but I put that on the "to do" pile for possible improvements later on, after I'd have gotten the asynchronous I/O working. My file format contains a struct followed by two matrices (same dimensions). Right now, I write the header via rank 0 and then each rank writes one stripe for each matrix, resulting in two Requests pending. I gather that I'd need to construct one or two more data types for split-collective I/O to be applicable, i.e., so the whole write happens in one call.
I know you are eager to hide I/O in the background -- to get it for free -- but there's no such thing as a free lunch. Background I/O might still perturb your computation phase, unless you make zero MPI calls in your computational phase. Collective I/O can bring some fairly powerful optimizations to the table and reduce your overall I/O costs, perhaps even reducing them enough that you no longer miss true asynchronous I/O ?
I'll give that a try then. Thanks, Chris