On Tue, Aug 6, 2019 at 9:54 AM Emmanuel Thomé via users <
users@lists.open-mpi.org> wrote:
> Hi,
>
> In the attached program, the MPI_Allgather() call fails to communicate
> all data (the amount it communicates wraps around at 4G...). I'm running
> on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested
> both).
>
> With the OFI mtl, the failure is silent, with no error message reported.
> This is very annoying.
>
> With the PSM2 mtl, we have at least some info printed that 4G is a limit.
>
> I have tested it with various combinations of mca parameters. It seems
> that the one config bit that makes the test pass is the selection of the
> ob1 pml. However I have to select it explicitly, because otherwise cm is
> selected instead (priority 40 vs 20, it seems), and the program fails. I
> don't know to which extent the cm pml is the root cause, or whether I'm
> witnessing a side-effect of something.
>
> openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11):
>
> node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 ./a.out
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: ...
> Message size 4295032832 bigger than supported by PSM2 API. Max =
> 4294967296
> MPI error returned:
> MPI_ERR_OTHER: known error not in list
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: NOK
> [node0.localdomain:14592] 1 more process has sent help message
> help-mtl-psm2.txt / message too big
> [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
> node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 --mca
> mtl ofi ./a.out
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: ...
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: NOK
> node 0 failed_offset = 0x10002
> node 1 failed_offset = 0x1
>
> I attached the corresponding outputs with some mca verbose
> parameters on, plus ompi_info, as well as variations of the pml layer
> (ob1 works).
>
> openmpi-4.0.1 gives essentially the same results (similar files
> attached), but with various doubts on my part as to whether I've run this
> check correctly. Here are my doubts:
> - whether I should or not have an ucx build for an omnipath cluster
> (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
>
UCX is not optimized for Omni Path. Don't use it.
> - which btl I should use (I understand that openib goes to
> deprecation and it complains unless I do --mca btl openib --mca
> btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
> btl should I use instead ?)
>
OFI->PS2 and PSM2 are the right conduits for Omni Path.
> - which layers matter, which ones matter less... I tinkered with btl
> pml mtl. It's fine if there are multiple choices, but if some
> combinations lead to silent data corruption, that's not really
> cool.
>
It sounds like Open-MPI doesn't properly support the maximum transfer size
of PSM2. One way to work around this is to wrap your MPI collective calls
and do <4G chunking yourself.
Jeff
> Could the error reporting in this case be somehow improved ?
>
> I'd be glad to provide more feedback if needed.
>
> E.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users