Ah, missed some other stuff. The bug we've seen is one that's not always visible, particularly on busy systems, because the event loop never waits very long. However, in the particular use case here, the systems are sufficiently lightly loaded and the latency requirements sufficiently strict that ATS behavior is, from their point of view, broken. I.e. there are large (10s of ms) delays on systems with no load. Fei built a fix for this, which changed the behavior of TSContSchedule, but there was blow back on that because of the behavior change (see the PR for a long discussion). To fix the production issue and satisfy requirements from other members of the community, Fei is proposing splitting the API so there is a clear distinction between "immediate" and "yield". I had earlier suggested using a zero timeout for this, but that got bonged. Therefore I think this is what we have to do to resolve the issue.
On Wed, Nov 20, 2019 at 3:52 PM Alan Carroll < solidwallofc...@verizonmedia.com> wrote: > For event processing style programming, it's standard to have two event > scheduling "styles" - an immediate "do as soon as possible", and a > "scheduled" style which is guaranteed to unwind the event processing stack > at least once. In another way, it's similar to "call" and "yield" in > co-routines. I will say I'm not sure why TSContSchedule should need a flag > - I'd prefer it always yields, and the new API be the immediate one. Part > of this is to address scw00's concern that a naive plugin writer could > schedule for 0 time in the future, presuming a yield, but in fact creating > an infinite ("dead") loop. Separating the APIs both restores existing > behavior before the 6103 PR and enables providing an API needed to fix a > production problem. > > On Wed, Nov 20, 2019 at 10:37 AM Sudheer Vinukonda > <sudheervinuko...@yahoo.com.invalid> wrote: > >> Hmm..Why’d the API need to differentiate the implementation details to >> users? Alternately, why’d someone pick an API that may have a hole? >> >> I haven’t fully analyzed and understood the proposed changes, but having >> two different API that only differ in how they are implemented under the >> hood (and one of them with an apparent hole) doesn’t sound right to me. >> >> - Sudheer >> >> >> > On Nov 20, 2019, at 8:01 AM, Fei Deng <f...@verizonmedia.com.invalid> >> wrote: >> > >> > Let me rephrase that, the new API behaves the same as the >> TSContSchedule >> > with 0 timeout after PR#6103, which will handle events as soon as >> possible. >> > While this is good for delays, it also causes the situation scw00 >> brought >> > up (dead loop). And since there is no good way of differentiating this >> > behavior with the original one where the event is put into the >> > EventQueueExternal, and waiting for the next eventloop (with possible >> 60ms >> > delay if on the same thread). So I want to change the >> > current TSContSchedule code to make sure that we are not suffering from >> a >> > dead loop, and also create a new API such that the new execute ASAP is >> > retained. >> > >> >> On Tue, Nov 19, 2019 at 6:06 PM Leif Hedstrom <zw...@apache.org> >> wrote: >> >> >> >> >> >> >> >>>> On Nov 20, 2019, at 05:52, Fei Deng <duke8...@apache.org> wrote: >> >>> >> >>> Forgot to mention that this change would restore the old behavior of >> >>> TSContSchedule minus the delay and dead loop. >> >>> >> >>>> On Tue, Nov 19, 2019 at 2:39 PM Fei Deng <duke8...@apache.org> >> wrote: >> >>>> >> >>>> While PR#6103 (https://github.com/apache/trafficserver/pull/6103) >> >> solves >> >>>> the problem of having the 60ms delay (caused by waiting in sleep), it >> >> also >> >>>> creates an issue where the event loop ends up in a "dead loop" if the >> >>>> scheduled event schedules itself with 0 timeout (see test code by >> >> scw00). >> >>>> Here's what I have in mind that will fix the problem. >> >>>> >> >>>> >> >>>> 1. New API "TSContDispatch", which will be basically the current >> >>>> "TSContSchedule" calls with 0 timeouts. >> >> >> >> >> >> Dumb question, why do we need a new API, if it behaves the same as a >> call >> >> to TSContSchedule with a timeout of 0? >> >> >> >> I’m getting pretty concerned about these changes, as I think we all >> agree, >> >> this is critical code, and it’s (as we are seeing) easy to break >> things in >> >> bad ways. >> >> >> >> — Leif >> >> >> >>>> 2. When scheduling events/continuations using "TSContSchedule" API >> >> with >> >>>> 0 timeouts, set flag to identify the event to be processed on the >> next >> >>>> event loop. >> >>>> 3. New "LocalQueue" class to handle events that are supposed to be >> >>>> processed in the next event loop. >> >>>> >> >>>> Here's the branch I'm working on right now, it shows an easy concept >> of >> >> the >> >>>> "LocalQueue". >> >>>> https://github.com/duke8253/trafficserver/tree/master-event_dispatch >> >>>> >> >> >> >> >> >>