> have a user whos code at scale dies reliably with the errors (new hosts each
> time):
>
> We have been using for this code:
> -mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12
>
> Without that option it dies with an out of memory message reliably.
>
> Note this code runs fine
have a user whos code at scale dies reliably with the errors (new hosts each
time):
We have been using for this code:
-mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12
Without that option it dies with an out of memory message reliably.
Note this code runs fine at the same scale