Meanwhile, much later -- you'll sympathize: Did you have any joy with this?
You wrote: > These messages appeared when running IMB compiled with openmpi 1.6.1 > across 256 cores (16 nodes, 16 cores per node). The job ran from > 09:56:54 to 10:08:46 and failed with no obvious error messages. I don't know about the messages, but there are successful ×256 runs in my ~/imb, one with default params, and also ×512. The 256 ones were accounted ∼330GB, and the default h_vmem is still 1G. That's not the cause of the failure, is it? For the kernel issue, do you actually have the same adaptors under RH5 to compare? lspci says our current QDR ones are the same as yours (surprisingly), and they're OK with openib params from the default Mellanox OFED setup. They're on older OFED (not vanilla RH) due to our horrible hardware mixture. With the new stuff I can't do more than access a head. That also has the same adaptors, but I don't know if MPI runs have your symptom. The modprobe config is different from yours, but the driver is older. If it might help, I could poke around node images and send config files, but I don't know what the various different images are. I might have hoped you wouldn't have to sort this yourself, if I was a beginner...