Hi, Re 1. Can you be more specific? What system are you using, what’s happening and how does it brake?
While delaying windows firing is probably the most cost effective solution for this particular problem, it has some disadvantages: a) putting even more logic to already complicated component b) not solving potential similar problems. I can easily imagine the same issue happening to other scenarios then "interval based operators” such as: - input sources faster then output sinks - data skew - data bursts - users' custom operators causing data bursts - users’ custom operators being prone to bursts (maybe something like AsyncOperator or something else that works with an external system) - so the problem might not necessarily be limited to the sinks As far as I recall, there were some users reporting some similar issues. Regarding potential drawbacks of rate limiting, I didn’t understand this part: > However the problem is similar to delay triggers which can provide degraded > performance for skew sensitive downstream service, such as feeding feature > extraction results to deep learning model. The way how I could imagine RateLimitingOperator is that it could take a parameters: rate limits, buffer size limit. With buffer size = 0, it would cause immediately a back pressure if rate is exceeded With buffer size > 0, ti would first buffer events on the state and only when reaching max buffer size, causing the back pressure For the case with WindowOperator, if windows are evicted and removed from the state, using buffer size > 0, wouldn’t cause increased state usage, it would only move the state from the WindowOperator to the RateLimitingOperator. Piotrek > On 27 Sep 2018, at 17:28, Rong Rong <walter...@gmail.com> wrote: > > HI Piotrek, > > Yes, to be more clear, > 1) the network I/O issue I am referring to is in between Flink and external > sink. We did not see issues in between operators. > 2) yes we've considered rate limiting sink functions as well which is also > mentioned in the doc. along with some of the the pro-con we identified. > > This kind of problem seems to only occur in WindowOperator so far, but yes > it can probably occur to any aligned interval based operator. > > -- > Rong > > On Wed, Sep 26, 2018 at 11:44 PM Piotr Nowojski <pi...@data-artisans.com> > wrote: > >> Hi, >> >> Thanks for the proposal. Could you provide more >> background/explanation/motivation why do you need such feature? What do you >> mean by “network I/O” degradation? >> >> On it’s own burst writes shouldn’t cause problems within Flink. If they >> do, we might want to fix the original underlying problem and if they are >> causing problems in external systems, we also might think about other >> approaches to fix/handle the problem (write rate limiting?), which might be >> more general and not fixing only bursts originating from WindowOperator. >> I’m not saying that your proposal is bad or anything, but I would just like >> to have more context :) >> >> Piotrek. >> >>> On 26 Sep 2018, at 19:21, Rong Rong <walter...@gmail.com> wrote: >>> >>> Hi Dev, >>> >>> I was wondering if there's any previous discussion regarding how to >> handle >>> burst network I/O when deploying Flink applications with window >> operators. >>> >>> We've recently see some significant network I/O degradation when trying >> to >>> use sliding window to perform rolling aggregations. The pattern is very >>> periodic: output connections get no traffic for a period of time until a >>> burst at window boundaries (in our case every 5 minutes). >>> >>> We have drafted a doc >>> < >> https://docs.google.com/document/d/1fEhbcRgxxX8zFYD_iMBG1DCbHmTcTRfRQFXelPhMFiY/edit?usp=sharing >>> >>> on >>> how we proposed to handle it to smooth the output traffic spikes. Please >>> kindly take a look, any comments and suggestions are highly appreciated. >>> >>> -- >>> Rong >> >>