Hello All, I have a RoCE interoperability event starting next week and I was wondering if anyone had any ideas to help me with a new vendor I am trying to help get ready.
I am using: * Open MPI 2.1 * Intel MPI Benchmarks 2018 * OFED 3.18 (requirement from vendor) * SLES 11 SP3 (requirement from vendor) The problem seems to be that the device does not handle larger message sizes well and I am sure they will be working on this but I am hoping there may be a way to complete an IMB run with some Open MPI parameter tweaking. Sample of IMB output from a Sendrecv benchmark: 262144 160 131.07 132.24 131.80 3964.56 524288 80 277.42 284.57 281.57 3684.71 1048576 40 461.16 474.83 470.02 4416.59 2097152 3 1112.15 4294965.49 2147851.04 0.98 4194304 2 2815.25 8589929.73 3222731.54 0.98 In red text is what looks like the problematic results. This happens on many of the benchmarks at larger message sizes and causes either a major slowdown or it causes the job to abort with error: The InfiniBand retry count between two MPI processes has been exceeded. If anyone has any thoughts on how I can complete the benchmarks without the job aborting I would appreciate it. If anyone has ideas as to why a RoCE device might show this issue I would take any information on offer. If more data is required please let me know what is relevant. Thank you, Brendan T. W. Myers
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users