On Nov 21, 2006, at 1:27 PM, Brock Palen wrote:
I had sent a message two weeks ago about this problem and talked with
jeff at SC06 about how it might not be a OMPI problem. But it
appears now working with myricom that it is a problem in both
lam-7.1.2 and openmpi-1.1.2/1.1.1. Basically the results from a HPL
run are wrong, Also causes a large number of packets to be dropped
by the fabric.
This problem does not happen when using mpichgm. The number of
dropped packets does not go up. There is a ticket open with myircom
on this. They are a member of the group working on OMPI but i sent
this out just to bring the list uptodate.
If you have any questions feel free to ask me. The details are in
the archive.
Brock Palen
Hi all,
I am working on this ticket at Myricom.
I am using Linux nodes since we do not have two OSX machines running
10.3 available. Each has 1 GB of RAM and two Myrinet PCI-X cards, a
single-port D card and a dual-port E card. I have disabled the E
card. I am using GM-2.0.26. I am also using Open-MPI 1.2b1.
I am running HPCC which includes HPL as well as other benchmarks.
Using Brock's HPL.dat values in my hpccinf.txt, I do not see any
failed HPL runs. I do see some runs hang and require a reboot (the
machine is unresponsive), but it may happen in the HPL portion of the
run or it may happen in another benchmark.
My last few runs all completed successfully without hanging. The job
I am currently running just hung one node (can respond to ping,
cannot ssh into it, cannot use any terminals connected to it).
There are no messages in dmesg and vmstat shows that the node is not
swapping (before it hung).
Any ideas where I should look next?
Scott