Hello.

Sorry for the delay in confirming the minimum load that would trigger
the RnR error; the holidays here were a significant interruption.

On Mon, Dec 19, 2011, at 03:30 PM, Yevgeny Kliteynik wrote:

> What's the smallest number of nodes that are needed to reproduce this
> problem? Does it happen with just two HCAs, one process per node?

Our nodes with these HCAs are dual-socket, 4 Intel cores/socket.

Working with the users, it turns out we were unable to reproduce the
issue with anything less than 3 nodes and 17 processes total, with no
nodes oversubscribed.  So two nodes were running with 8 processes each
and the third with 1 process.

It could be some sort of race condition or timing issue that could
theoretically be triggered for less than this, but we weren't able to
provoke it.

> Let's get you to the latest firmware GA of this card.

Just as a reminder, I responded to the firmware part of this earlier:
http://www.open-mpi.org/community/lists/users/2011/12/18014.php

Thank you,

V. Ram

-- 
http://www.fastmail.fm - Access your email from home and the web

Reply via email to