On Jan 8, 2007, at 9:34 PM, Reese Faucette wrote:

Right, that's the maximum number of open MX channels, i.e. processes
than can run on the node using MX. With MX (1.2.0c I think), I get
weird messages if I run a second mpirun quickly after the first one
failed. The myrinet guys, I quite sure, can explain why and how.
Somehow, when an application segfault while the MX port is open
things are not cleaned up right away. It take few seconds (not more
than one minute) to have everything running correctly after that.

Supposedly I am a "myrinet guy" ;-) Yeah, the endpoint cleanup stuff could take a few seconds after an ungraceful exit. But, if you're getting some
behavior that looks like you ought not be getting, please let us know!

I think it make sense what I get. If I loop in a script starting mpiruns and one of the run segfault, the next one usually is unable to open the MX endpoints. That's happens only if I run 4 processes by node, where 4 is the number of instances as reported by mx_info. If I put a sleep of 30 seconds between my runs, then everything runs just fine.

  george.

-reese
Myricom, Inc.


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to