That would be great! On Dec 17, 2009, at 3:52 PM, Ralph Castain wrote:
> If it would help, I have time and am willing to add notifier calls to this > area of the code base. You'll still get the errors shown here as I always > bury the notifier call behind the error check that surrounds these error > messages to avoid impacting the critical path, but you would be able to gt > syslog messages (or whatever channel you choose) as well. > > Ralph > > On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote: > > > On Dec 10, 2009, at 5:06 PM, Brock Palen wrote: > > > >> I would like to try out the notifier framework, problem is I am having > >> trouble finding documentation for it, I am digging around the website > >> and not finding much. > >> > >> Currently we have a problem where hosts are throwing up errors like: > >> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c: > >> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: > >> Connection timed out (110) > > > > Yoinks. Any idea why this is happening? > > > >> We would like when this happens to notify us, so we can put time > >> stamps on events going on on the network. Is this even possible with > >> the frame work? See we don't show any interfaces coming up and down, > >> or any errors on interfaces, so we are looking to isolate the problem > >> more. Only the MPI library knows when this happens. > > > > It's not well documented. So let's start here... > > > > The first issue is that we currently only have notifier calls down in the > > openib BTL -- not any of the others. :-( We put it there because there > > was specific requests to be notified when IB links went down. We then used > > those as a request for comment from the community, asking "do you like > > this? do you want more?" We kinda got nothing back, and I'll admit that we > > kinda forgot about it -- and therefore never added notifier calls elsewhere > > in the code. :-\ > > > > We designed the notifier in OMPI to be trivially easy to use throughout the > > code base -- it's just adding a single function call where the error > > occurs. Would you, perchance, be interested in adding any of these in the > > TCP BTL? I'd be happy to point you in the right direction... :-) > > > > After that, it's just a matter of enabling a notifier: > > > > mpirun --mca notifier syslog ... > > > > Each notifier has some MCA params that are fairly obvious -- use: > > > > ompi_info --param notifier all > > > > to see them. There's 3 notifier plugins: > > > > - command: execute any arbitrary command. It must run in finite (short) > > time. You use MCA params to set the command (we can pass some strings down > > to the command; see the ompi_info help string for more details), and set a > > timeout such that if the command runs for that many seconds without > > exiting, we'll kill it. > > > > - syslog: because it was simple to do -- we just output a string to the > > syslog. > > > > - twitter: because it was fun to do. ;-) Actually, the rationale was that > > you can tweet to a private feed and then slave an RSS reader to it to see > > if anything happens. It will need to be able to reach the general internet > > (i.e., twitter.com); proxies are not supported. Set your twitter > > username/password via MCA params. > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Jeff Squyres jsquy...@cisco.com