Hi, >>>>> By any chance is it a particular node (or pair of nodes) this seems to >>>>> happen with? >>>> >>>> No. I've got 40 nodes total with this hardware configuration, and the >>>> problem has been seen on most/all nodes at one time or another. It >>>> doesn't seem, based on the limited number of observable parameters I'm >>>> aware of, to be dependent on the number of nodes involved.
What's the smallest number of nodes that are needed to reproduce this problem? Does it happen with just two HCAs, one process per node? >>>>>>> We are running a cluster that has a good number of older nodes with >>>>>>> Mellanox IB HCAs that have the "mthca" device name ("ib_mthca" kernel >>>>>>> module). >>>>>>> >>>>>>> These adapters are all at firmware level 4.8.917 . >>>>>>> >>>>>>> The Open MPI in use is 1.5.3 , kernel 2.6.39 , x86-64. Jobs are >>>>>>> launched/managed using Slurm 2.2.7. The IB software and drivers >>>>>>> correspond to OFED 1.5.3.2 , and I've verified that the kernel modules >>>>>>> in use are all from this OFED version. Let's get you to the latest firmware GA of this card. Run "ibv_devinfo | grep board_id", and find the latest FW GA for your device here: http://www.mellanox.com/content/pages.php?pg=firmware_download It has all the instructions how to update FW. Also, please post here some more information about your HCA ("ibv_devinfo" output should do). -- YK