[openstack-dev] [AODH] event-alarm timeout discussion

Zhai, Edwin Tue, 20 Sep 2016 22:51:44 -0700

All,

I'd like make some clarification for the event-alarm timeout design as many ofyou have some misunderstanding here. Pls. correct me if any mistakes.


I realized that there are 2 different things, but we mix them sometime:
1. event-timeout-alarm

This is one new type of alarm that bracket *.start and *.end events and getalarmed when receive *.start but no *.end in timeout. This new alarm handles onetype of events/actions, e.g. create one alarm for instance creation, then allinstances created in future will be handled by this alarm. This is not for realtime, so it's acceptable that user know one instance creation failure in 5 mins.

This new type of alarm can be implemented by one worker to check the DBperiodically to do the statistic work. That is, new evaluator works in 'polling'mode, something like threshold alarm evaluator.


One BP is @
https://review.openstack.org/#/c/199005/

2. event-alarm timeout

This is one new feature for _existed_ event-alarm evaluator. One alarm becomes'UNALARM' when not receive desire event in timeout. This feature just handlesone specific event, e.g create one alarm for instance ABC's XYZ operation with5s, then user is notified in 5s immediately if no XYZ.done event comes. If wantcheck for another instance, we need create another alarm.

This is used in telco scenario, where operator want know if operation failure inreal time.

My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, butI feel many guys mistaken them(sometimes even me) as they looks similar. So myquestion is: Do you think this telco usage model of event-alarm timeout isvalid? If not, we can avoid discussing its implementation and ignore following.



=========== event-alarm timeout implementation =============

As it's for event-alarm, we need keep it as event-driven. Furthermore, for quickresponse, we need use event for timeout handling. Periodic worker can't meetreal time requirement.

Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads trickyrace condition. e.g. 'XYZ.done' comes in queue1, and 'alarm.timeout.end' comesin queue2, so that they are handled in parallel way:

1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will beset ALARM in next step.2. In queue2, 'alarm.timeout.end' is checking against same alarm(currentUNKNOWN), and will be set to OK(UNALARM) in next step.

3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)

So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user:required event came, then no required event came;

If put all events in one queue, evaluator handles them one by one(low level oslomesg should be multi-threaded) so that second event would see alarm state as notUNKNOWN, and give up its transition. As Gordc said, it's slow. But only verysmall part of the event-alarm need timeout handling, as it's only for telcousage model.

One possible improvement as JD pointed out is to avoid so many spawned thread.We can just create one thread inside evaluator, and ask this thread handle alltimeout requests from evaluator. Is it acceptable for event-alarm timeoutsolution?



Best Rgds,
Edwin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [AODH] event-alarm timeout discussion

Reply via email to