I'm trying to overlap as much as possible MatAssembly with other computations, and I'm finding a confusing result. If I follow my AssemblyBegin with an immediate AssemblyEnd it takes 31 seconds to assemble the matrix. If I interleave a 10 minute computation between AssemblyBegin and AssemblyEnd I find that executing only AssemblyEnd still takes 27 seconds. So it takes 31 seconds to complete the entire transaction, or after 10 minutes of compute I still find myself stuck with 27 seconds of wait time.
Now clearly, from the standpoint of optimization, 27 seconds in the presence of 10 minute computations is not something to waste brain cycles on. Nevertheless, I'm always curious about discrepancies between the world in my head and the actual world ;-) Here are some points that may be of interest: * I'm using a debug compile of PETSc. I wouldn't guess this makes much difference as long as BLAS and LAPACK are optimized. * One node does not participate in the computation, instead it acts as a work queue; doling out work whenever a "worker" processor becomes available. As such the "server" node makes a lot of calls to MPI_Iprobe. Could this be interfering with PETSc's background use of MPI? Thanks, -Andrew
