Probably want to check to make sure that lossless ethernet is enabled 
everywhere (that's a common problem I've seen); otherwise, you end up in 
timeouts and retransmissions.

Check with your vendor on how to do layer-0 diagnostics, etc.

Also, if this is a new vendor, they should probably try running this themselves 
-- IMB is fairly abusive to the network stack and turns up many bugs in lower 
layers (drivers, firmware), etc.


> On Oct 10, 2017, at 3:29 PM, Brendan Myers <brendan.my...@soft-forge.com> 
> wrote:
> 
> Hello All,
> I have a RoCE interoperability event starting next week and I was wondering 
> if anyone had any ideas to help me with a new vendor I am trying to help get 
> ready. 
> I am using:
> ·         Open MPI 2.1
> ·         Intel MPI Benchmarks 2018
> ·         OFED 3.18 (requirement from vendor)
> ·         SLES 11 SP3 (requirement from vendor)
>  
> The problem seems to be that the device does not handle larger message sizes 
> well and I am sure they will be working on this but I am hoping there may be 
> a way to complete an IMB run with some Open MPI parameter tweaking.
> Sample of IMB output from a Sendrecv benchmark:
>  
> 262144          160       131.07       132.24       131.80      3964.56
>        524288           80       277.42       284.57       281.57      3684.71
>       1048576           40       461.16       474.83       470.02      4416.59
>       2097152            3      1112.15   4294965.49   2147851.04         0.98
>       4194304            2      2815.25   8589929.73   3222731.54         0.98
>  
> In red text is what looks like the problematic results. This happens on many 
> of the benchmarks at larger message sizes and causes either a major slowdown 
> or it causes the job to abort with error:
>  
> The InfiniBand retry count between two MPI processes has been exceeded.
>  
> If anyone has any thoughts on how I can complete the benchmarks without the 
> job aborting I would appreciate it. If anyone has ideas as to why a RoCE 
> device might show this issue I would take any information on offer. If more 
> data is required please let me know what is relevant.
>  
>  
> Thank you,
> Brendan T. W. Myers
>  
>  
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to