That would be great!

On Dec 17, 2009, at 3:52 PM, Ralph Castain wrote:

> If it would help, I have time and am willing to add notifier calls to this 
> area of the code base. You'll still get the errors shown here as I always 
> bury the notifier call behind the error check that surrounds these error 
> messages to avoid impacting the critical path, but you would be able to gt 
> syslog messages (or whatever channel you choose) as well.
> 
> Ralph
> 
> On Dec 10, 2009, at 4:19 PM, Jeff Squyres wrote:
> 
> > On Dec 10, 2009, at 5:06 PM, Brock Palen wrote:
> >
> >> I would like to try out the notifier framework, problem is I am having  
> >> trouble finding documentation for it,  I am digging around the website  
> >> and not finding much.
> >>
> >> Currently we have a problem where hosts are throwing up errors like:
> >> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c:
> >> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed:
> >> Connection timed out (110)
> >
> > Yoinks.  Any idea why this is happening?
> >
> >> We would like when this happens to notify us, so we can put time
> >> stamps on events going on on the network.  Is this even possible with
> >> the frame work?  See we don't show any interfaces coming up and down,
> >> or any errors on interfaces, so we are looking to isolate the problem
> >> more.  Only the MPI library knows when this happens.
> >
> > It's not well documented.  So let's start here...
> >
> > The first issue is that we currently only have notifier calls down in the 
> > openib BTL -- not any of the others.  :-(  We put it there because there 
> > was specific requests to be notified when IB links went down.  We then used 
> > those as a request for comment from the community, asking "do you like 
> > this? do you want more?"  We kinda got nothing back, and I'll admit that we 
> > kinda forgot about it -- and therefore never added notifier calls elsewhere 
> > in the code.  :-\
> >
> > We designed the notifier in OMPI to be trivially easy to use throughout the 
> > code base -- it's just adding a single function call where the error 
> > occurs.  Would you, perchance, be interested in adding any of these in the 
> > TCP BTL?  I'd be happy to point you in the right direction... :-)
> >
> > After that, it's just a matter of enabling a notifier:
> >
> >  mpirun --mca notifier syslog ...
> >
> > Each notifier has some MCA params that are fairly obvious -- use:
> >
> >  ompi_info --param notifier all
> >
> > to see them.  There's 3 notifier plugins:
> >
> > - command: execute any arbitrary command.  It must run in finite (short) 
> > time.  You use MCA params to set the command (we can pass some strings down 
> > to the command; see the ompi_info help string for more details), and set a 
> > timeout such that if the command runs for that many seconds without 
> > exiting, we'll kill it.
> >
> > - syslog: because it was simple to do -- we just output a string to the 
> > syslog.
> >
> > - twitter: because it was fun to do.  ;-)  Actually, the rationale was that 
> > you can tweet to a private feed and then slave an RSS reader to it to see 
> > if anything happens.  It will need to be able to reach the general internet 
> > (i.e., twitter.com); proxies are not supported.  Set your twitter 
> > username/password via MCA params.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to