On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann <treum...@us.ibm.com> wrote: > It is hard to imagine how a total data load of 41,943,040 bytes could be a > problem. That is really not much data. By the time the BCAST is done, each > task (except root) will have received a single half meg message form one > sender. That is not much.
Thanks very much for your comments Dick! I'm somewhat new to MPI so appreciate all the advice I can get.My main roadblock is I'm not sure how to attack this problem more? How can I obtain more diagnostic output to help me trace what the origin of this "broadcast stall" is? So far I've obtained a stack trace via padb ( http://dl.dropbox.com/u/118481/padb.log.new.new.txt ) but that is about all. Any suggestions as to what else I could try? Would a full dump by something like tcpdump or wireshark on the packets passing the network be of any relevance? Or is there something useful to be known from the switch side? The technology is fairly new for HPC (Chelsio 10GigE adapters + Cisco Nexus5000 switches). So I wouldn't rule out some strange hardware or firmware bug that's tickled by this particular suite of tests. I'm grasping at straws here. [ On the other hand I'm fairly new so I wouldn't rule out some silly setting by me as well. ] -- Rahul