On Dec 10, 2009, at 5:06 PM, Brock Palen wrote:

> I would like to try out the notifier framework, problem is I am having  
> trouble finding documentation for it,  I am digging around the website  and 
> not finding much.
> 
> Currently we have a problem where hosts are throwing up errors like:
> [nyx0891.engin.umich.edu][[25560,1],45][btl_tcp_endpoint.c:
> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed: 
> Connection timed out (110)

Yoinks.  Any idea why this is happening?

> We would like when this happens to notify us, so we can put time 
> stamps on events going on on the network.  Is this even possible with 
> the frame work?  See we don't show any interfaces coming up and down, 
> or any errors on interfaces, so we are looking to isolate the problem 
> more.  Only the MPI library knows when this happens.

It's not well documented.  So let's start here...

The first issue is that we currently only have notifier calls down in the 
openib BTL -- not any of the others.  :-(  We put it there because there was 
specific requests to be notified when IB links went down.  We then used those 
as a request for comment from the community, asking "do you like this? do you 
want more?"  We kinda got nothing back, and I'll admit that we kinda forgot 
about it -- and therefore never added notifier calls elsewhere in the code.  :-\

We designed the notifier in OMPI to be trivially easy to use throughout the 
code base -- it's just adding a single function call where the error occurs.  
Would you, perchance, be interested in adding any of these in the TCP BTL?  I'd 
be happy to point you in the right direction... :-)

After that, it's just a matter of enabling a notifier:

  mpirun --mca notifier syslog ...

Each notifier has some MCA params that are fairly obvious -- use:

  ompi_info --param notifier all

to see them.  There's 3 notifier plugins:

- command: execute any arbitrary command.  It must run in finite (short) time.  
You use MCA params to set the command (we can pass some strings down to the 
command; see the ompi_info help string for more details), and set a timeout 
such that if the command runs for that many seconds without exiting, we'll kill 
it.

- syslog: because it was simple to do -- we just output a string to the syslog.

- twitter: because it was fun to do.  ;-)  Actually, the rationale was that you 
can tweet to a private feed and then slave an RSS reader to it to see if 
anything happens.  It will need to be able to reach the general internet (i.e., 
twitter.com); proxies are not supported.  Set your twitter username/password 
via MCA params.

-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to