Hi Leonardo,

Leonardo Fialho wrote:
NETDEV WATCHDOG: eth0: transmit timed out
tg3: eth0: transmit timed out, resetting
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: eth0: Link is down.
tg3: eth0: Link is up at 1000 Mbps, full duplex.

The tg3 driver times out because the transmit is stuck. It can be an interrupt problem or bad hardware flow-control on the switch. Since it works after the driver resets the link, it looks like either the switch flow control is busted (try to turn it off or try between 2 nodes in back-to-back) or one other node stops consuming.

Open-MPI may generate enough contention to trigger the problem but I don't think it is directly related to Open-MPI.

Patrick

Reply via email to