Probably want to check to make sure that lossless ethernet is enabled everywhere (that's a common problem I've seen); otherwise, you end up in timeouts and retransmissions.
Check with your vendor on how to do layer-0 diagnostics, etc. Also, if this is a new vendor, they should probably try running this themselves -- IMB is fairly abusive to the network stack and turns up many bugs in lower layers (drivers, firmware), etc. > On Oct 10, 2017, at 3:29 PM, Brendan Myers <brendan.my...@soft-forge.com> > wrote: > > Hello All, > I have a RoCE interoperability event starting next week and I was wondering > if anyone had any ideas to help me with a new vendor I am trying to help get > ready. > I am using: > · Open MPI 2.1 > · Intel MPI Benchmarks 2018 > · OFED 3.18 (requirement from vendor) > · SLES 11 SP3 (requirement from vendor) > > The problem seems to be that the device does not handle larger message sizes > well and I am sure they will be working on this but I am hoping there may be > a way to complete an IMB run with some Open MPI parameter tweaking. > Sample of IMB output from a Sendrecv benchmark: > > 262144 160 131.07 132.24 131.80 3964.56 > 524288 80 277.42 284.57 281.57 3684.71 > 1048576 40 461.16 474.83 470.02 4416.59 > 2097152 3 1112.15 4294965.49 2147851.04 0.98 > 4194304 2 2815.25 8589929.73 3222731.54 0.98 > > In red text is what looks like the problematic results. This happens on many > of the benchmarks at larger message sizes and causes either a major slowdown > or it causes the job to abort with error: > > The InfiniBand retry count between two MPI processes has been exceeded. > > If anyone has any thoughts on how I can complete the benchmarks without the > job aborting I would appreciate it. If anyone has ideas as to why a RoCE > device might show this issue I would take any information on offer. If more > data is required please let me know what is relevant. > > > Thank you, > Brendan T. W. Myers > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users