Hi Ifat,

         Sorry for reply too late. Please see my inline comments for your 
question.




It won’t solve the general problem of two different monitors that raise the 
same alarm

   [yinliyin] Generally, we would only deploy one monitor for a same alarm. 

It won’t solve possible conflicts of timestamp and severity between different 
monitors

  [yinliyin] Please see the following contents.

It will make the decision of when to delete the alarm more complex (delete it 
when the deduced alarm is deleted? When Nagios alarm is deleted? both? And how 
to change the timestamp and severity in these cases?)

 [yinliyin] Please see the following contents.




   The following is the basic idea of solving the problem in this situation:

       1.  In templates, we only define the alarm entity for the datasource 
that the alarm is reported by, such as Nagios.

       2.  When evaluator deduce an alarm, it would raise the alarm with the 
type set to be the datasource that would report the alarm, not be vitrage.

       3.  When entity_graph get the events from the "evaluator_queue"(all the 
alarms in the "evaluator_queue" are deduced alarms), it queries the graph to 
find out whether there was a same alarm reported  by datasource. If  it was 
true,  it would discard the alarm.

      4.  When entity_graph get the events from "queue",  it queries the graph 
to find out whether there was a same alarm deduced by evaluator. If it was 
true, it would replace the alarm in the graph with the newly arrived alarm 
reported by the datasource.

     5.  When the evaluator deduced that an alarm would be deleted, it deletes 
the alarm whatever the generation type of the alarm be(Generated by datasource 
or deduced by evaluator).  

     6. When datasource reports recover event of an alarm, entity_graph would 
query graph to find out whether the alarm was exist. If the alarm was not 
exist, entity_graph would discard the event.

 




        







































虚拟化上海五部/无线研究院/无线产品经营部 NIV Shanghai Dept. V/Wireless Product R&D 
Institute/Wireless Product Operation









上海市浦东新区碧波路889号中兴研发大楼D502 
D502, ZTE Corporation R&D Center, 889# Bibo Road, 
Zhangjiang Hi-tech Park, Shanghai, P.R.China, 201203 
T: +86 21 68896229
M: +86 13641895907 
E: yinli...@zte.com.cn
www.zte.com.cn


原始邮件



发件人: <ifat.a...@nokia.com>
收件人: <openstack-dev@lists.openstack.org>殷力殷10011231
抄送人:韩静00006838王维雅00042110章宇军10200531贾培源10101785龚亚辉6092001895
日 期 :2017年01月12日 17:08
主 题 :Re: [openstack-dev] [Vitrage] About alarms reported by datasource and the 
alarms generated by vitrage evaluator







Hi Yujun,


 


See my comments inline.


 


Ifat.


 



From: Yujun Zhang <zhangyujun+...@gmail.com>
 Date: Wednesday, 11 January 2017 at 12:12
 
 



 


I have just realized abstract alarm is not a good term. What I was talking 
about is fault and alarm. 


 


Fault is what actually happens, and alarm is how it is detected (or deduced).


 


 


On Wed, Jan 11, 2017 at 5:13 PM Yujun Zhang <zhangyujun+...@gmail.com> wrote:


 


I think YinLiYin's idea is a reasonable requirement from end user. They care 
more about the real faults in the system, not how they are detected. Though it 
will bring much challenge to design and engineering, it creates value for 
customers. I'm quite positive on this evolution.


 


[Ifat] Of course. I never argued about the need, just tried to figure out how 
we should implement it.


 


One possible solution would be introducing a high level (abstract) template 
from users view. Then convert it to Vitrage scenario templates (or directly to 
graph). The more sources (nagios, vitrage deduction) for an abstract alarm we 
get from the system, the more confidence we get for a real fault. And the 
confidence of an alarm could be included in the scenario condition.


 


[Ifat] I understand your idea, not sure yet if it helps with the use case.


How would you imagine the ‘confidence’ property? As Boolean or a counter? One 
option is ‘deduced’ vs. ‘monitored’. Another option is to count the number of 
monitors that reported it. Personally, I don’t think this is needed. I think 
that  if Nagios reports an error, then it is confident enough without getting 
it from another monitor.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to