Hi, In the attached program, the MPI_Allgather() call fails to communicate all data (the amount it communicates wraps around at 4G...). I'm running on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested both).
With the OFI mtl, the failure is silent, with no error message reported. This is very annoying. With the PSM2 mtl, we have at least some info printed that 4G is a limit. I have tested it with various combinations of mca parameters. It seems that the one config bit that makes the test pass is the selection of the ob1 pml. However I have to select it explicitly, because otherwise cm is selected instead (priority 40 vs 20, it seems), and the program fails. I don't know to which extent the cm pml is the root cause, or whether I'm witnessing a side-effect of something. openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11): node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 ./a.out MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ... Message size 4295032832 bigger than supported by PSM2 API. Max = 4294967296 MPI error returned: MPI_ERR_OTHER: known error not in list MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: NOK [node0.localdomain:14592] 1 more process has sent help message help-mtl-psm2.txt / message too big [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 --mca mtl ofi ./a.out MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: ... MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * 0x100010000 bytes: NOK node 0 failed_offset = 0x100020000 node 1 failed_offset = 0x10000 I attached the corresponding outputs with some mca verbose parameters on, plus ompi_info, as well as variations of the pml layer (ob1 works). openmpi-4.0.1 gives essentially the same results (similar files attached), but with various doubts on my part as to whether I've run this check correctly. Here are my doubts: - whether I should or not have an ucx build for an omnipath cluster (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?), - which btl I should use (I understand that openib goes to deprecation and it complains unless I do --mca btl openib --mca btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp btl should I use instead ?) - which layers matter, which ones matter less... I tinkered with btl pml mtl. It's fine if there are multiple choices, but if some combinations lead to silent data corruption, that's not really cool. Could the error reporting in this case be somehow improved ? I'd be glad to provide more feedback if needed. E.
#define _GNU_SOURCE #include <stdio.h> #include <stdint.h> #include <stdlib.h> #include <string.h> #include <mpi.h> long failed_offset = 0; size_t chunk_size = 1 << 16; size_t nchunks = (1 << 16) + 1; int main(int argc, char * argv[]) { if (argc >= 2) chunk_size = atol(argv[1]); if (argc >= 3) nchunks = atol(argv[1]); MPI_Init(&argc, &argv); /* * This function returns: * 0 on success. * a non-zero MPI Error code if MPI_Allgather returned one. * -1 if no MPI Error code was returned, but the result of Allgather * was wrong. * -2 if memory allocation failed. * * (note that the MPI document guarantees that MPI error codes are * positive integers) */ int size, rank; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); int err; char * check_text; int rc = asprintf(&check_text, "MPI_Allgather, %d nodes, 0x%zx chunks of 0x%zx bytes, total %d * 0x%zx bytes", size, nchunks, chunk_size, size, chunk_size * nchunks); if (rc < 0) abort(); if (!rank) printf("%s: ...\n", check_text); MPI_Datatype mpi_ft; MPI_Type_contiguous(chunk_size, MPI_BYTE, &mpi_ft); MPI_Type_commit(&mpi_ft); MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN); void * data = malloc(nchunks * size * chunk_size); memset(data, 0, nchunks * size * chunk_size); int alloc_ok = data != NULL; MPI_Allreduce(MPI_IN_PLACE, &alloc_ok, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD); if (alloc_ok) { memset(((char*)data) + nchunks * chunk_size * rank, 0x42, nchunks * chunk_size); err = MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, data, nchunks, mpi_ft, MPI_COMM_WORLD); if (err == 0) { void * p = memchr(data, 0, nchunks * size * chunk_size); if (p != NULL) { /* We found a zero, we shouldn't ! */ err = -1; failed_offset = ((char*)p)-(char*)data; } } } else { err = -2; } if (data) free(data); MPI_Type_free(&mpi_ft); if (!rank) { printf("%s: %s\n", check_text, err == 0 ? "ok" : "NOK"); } if (err == -2) { puts("Could not allocate memory buffer"); } else if (err != 0) { int someone_has_minusone = (err == -1); MPI_Allreduce(MPI_IN_PLACE, &someone_has_minusone, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); if (someone_has_minusone) { long * offsets = malloc(size * sizeof(long)); offsets[rank] = failed_offset; MPI_Gather(&failed_offset, 1, MPI_LONG, offsets, 1, MPI_LONG, 0, MPI_COMM_WORLD); if (!rank) { for(int i = 0 ; i < size ; i++) { printf("node %d failed_offset = 0x%lx\n", i, offsets[i]); } } free(offsets); } if (!rank) { if (err > 0) { /* return an MPI Error if we've got one. */ /* we often get MPI_ERR_OTHER... mostly useless */ char error[1024]; int errorlen = sizeof(error); MPI_Error_string(err, error, &errorlen); printf("MPI error returned:\n%s\n", error); } } } free(check_text); MPI_Finalize(); }
ompi_bug_20190806.tar.gz
Description: application/gzip
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users