All,

I'd like make some clarification for the event-alarm timeout design as many of you have some misunderstanding here. Pls. correct me if any mistakes.

I realized that there are 2 different things, but we mix them sometime:
1. event-timeout-alarm
This is one new type of alarm that bracket *.start and *.end events and get alarmed when receive *.start but no *.end in timeout. This new alarm handles one type of events/actions, e.g. create one alarm for instance creation, then all instances created in future will be handled by this alarm. This is not for real time, so it's acceptable that user know one instance creation failure in 5 mins.

This new type of alarm can be implemented by one worker to check the DB periodically to do the statistic work. That is, new evaluator works in 'polling' mode, something like threshold alarm evaluator.

One BP is @
https://review.openstack.org/#/c/199005/

2. event-alarm timeout
This is one new feature for _existed_ event-alarm evaluator. One alarm becomes 'UNALARM' when not receive desire event in timeout. This feature just handles one specific event, e.g create one alarm for instance ABC's XYZ operation with 5s, then user is notified in 5s immediately if no XYZ.done event comes. If want check for another instance, we need create another alarm.

This is used in telco scenario, where operator want know if operation failure in real time.

My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, but I feel many guys mistaken them(sometimes even me) as they looks similar. So my question is: Do you think this telco usage model of event-alarm timeout is valid? If not, we can avoid discussing its implementation and ignore following.


=========== event-alarm timeout implementation =============
As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick response, we need use event for timeout handling. Periodic worker can't meet real time requirement.

Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky race condition. e.g. 'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes in queue2, so that they are handled in parallel way:

1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be set ALARM in next step. 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current UNKNOWN), and will be set to OK(UNALARM) in next step.
3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)

So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user: required event came, then no required event came;

If put all events in one queue, evaluator handles them one by one(low level oslo mesg should be multi-threaded) so that second event would see alarm state as not UNKNOWN, and give up its transition. As Gordc said, it's slow. But only very small part of the event-alarm need timeout handling, as it's only for telco usage model.

One possible improvement as JD pointed out is to avoid so many spawned thread. We can just create one thread inside evaluator, and ask this thread handle all timeout requests from evaluator. Is it acceptable for event-alarm timeout solution?


Best Rgds,
Edwin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to