On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann <treum...@us.ibm.com> wrote:
> It is hard to imagine how a total data load of 41,943,040 bytes could be a
> problem. That is really not much data. By the time the BCAST is done, each
> task (except root) will have received a single half meg message form one
> sender. That is not much.

Thanks very much for your comments Dick! I'm somewhat new to MPI so
appreciate all the advice I can get.My main roadblock is I'm not sure
how to attack this problem more? How can I obtain more diagnostic
output to help me trace what the origin of this "broadcast stall" is?
So far I've obtained a stack trace via padb (
http://dl.dropbox.com/u/118481/padb.log.new.new.txt ) but that is
about all.

Any suggestions as to what else I could try? Would a full dump by
something like tcpdump or wireshark on the packets passing the network
be of any relevance? Or is there something useful to be known from the
switch side? The technology is fairly new for HPC (Chelsio 10GigE
adapters + Cisco Nexus5000 switches). So I wouldn't rule out some
strange hardware or firmware bug that's tickled by this particular
suite of tests.   I'm grasping at straws here.

 [ On the other hand I'm fairly new so I wouldn't rule out some silly
setting by me as well. ]

-- 
Rahul

Reply via email to