I'm running OpenMPI 2.1.0 on RHEL 7 using TCP communication. For the specific run that's crashing on me, I'm running with 17 ranks (on 17 different physical machines). I've got a stage in my application where ranks need to transfer chunks of data where the size of each chunk is trivial (on the order of 100 MB) compared to the overall imagery. However, the chunks are spread out across many buffers in a way that makes the indexing complicated (and the memory is not all within a single buffer)... the simplest way to express the data movement in code is by a large number of MPI_Isend() and MPI_Ireceive() calls followed of course by an eventual MPI_Waitall(). This works fine for many cases, but I've run into a case now where the chunks are imbalanced such that a few ranks have a total of ~450 MPI_Request objects (I do a single MPI_Waitall() with all requests at once) and the remaining ranks have < 10 MPI_Requests. In this scenario, I get a seg fault inside PMPI_Waitall().
Is there an implementation limit as to how many asynchronous requests are allowed? Is there a way this can be queried either via a #define value or runtime call? I probably won't go this route, but when initially compiling OpenMPI, is there a configure option to increase it? I've done a fair amount of debugging and am pretty confident this is where the error is occurring as opposed to indexing out of bounds somewhere, but if there is no such limit in OpenMPI, that would be useful to know too. Thanks. -Adam
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users