I'm big +1 for this feature. 1. Limit the input qps. 2. Change log level for debug.
in my team, the two examples above are needed JING ZHANG <beyond1...@gmail.com> 于2021年6月8日周二 上午11:18写道: > Thanks Jiangang for bringing this up. > As mentioned in Jiangang's email, `dynamic configuration framework` > provides many useful functions in Kuaishou, because it could update job > behavior without relaunching the job. The functions are very popular in > Kuaishou, we also see similar demands in maillist [1]. > > I'm big +1 for this feature. > > Thanks Xintong and Yun for deep thoughts about the issue. I like the idea > about introducing control mode in Flink. > It takes the original issue a big step closer to essence which also > provides the possibility for more fantastic features as mentioned in > Xintong and Jark's response. > Based on the idea, there are at least two milestones to achieve the goals > which were proposed by Jiangang: > (1) Build a common control flow framework in Flink. > It focuses on control flow propagation. And, how to integrate the > common control flow framework with existing mechanisms. > (2) Builds a dynamic configuration framework which is exposed to users > directly. > We could see dynamic configuration framework is a top application on > the underlying control flow framework. > It focuses on the Public API which receives configuration updating > requests from users. Besides, it is necessary to introduce an API > protection mechanism to avoid job performance degradation caused by too > many control events. > > I suggest splitting the whole design into two after we reach a consensus > on whether to introduce this feature because these two sub-topic all need > careful design. > > > [ > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-configuration-of-Flink-checkpoint-interval-td44059.html > ] > > Best regards, > JING ZHANG > > 刘建刚 <liujiangangp...@gmail.com> 于2021年6月8日周二 上午10:01写道: > >> Thanks Xintong Song for the detailed supplement. Since flink is >> long-running, it is similar to many services. So interacting with it or >> controlling it is a common desire. This was our initial thought when >> implementing the feature. In our inner flink, many configs used in yaml can >> be adjusted by dynamic to avoid restarting the job, for examples as follow: >> >> 1. Limit the input qps. >> 2. Degrade the job by sampling and so on. >> 3. Reset kafka offset in certain cases. >> 4. Stop checkpoint in certain cases. >> 5. Control the history consuming. >> 6. Change log level for debug. >> >> >> After deep discussion, we realize that a common control flow >> will benefit both users and developers. Dynamic config is just one of the >> use cases. For the concrete design and implementation, it relates with many >> components, like jobmaster, network channel, operators and so on, which >> needs deeper consideration and design. >> >> Xintong Song [via Apache Flink User Mailing List archive.] < >> ml+s2336050n44245...@n4.nabble.com> 于2021年6月7日周一 下午2:52写道: >> >>> Thanks Jiangang for bringing this up, and Steven & Peter for the >>> feedback. >>> >>> I was part of the preliminary offline discussions before this proposal >>> went public. So maybe I can help clarify things a bit. >>> >>> In short, despite the phrase "control mode" might be a bit misleading, >>> what we truly want to do from my side is to make the concept of "control >>> flow" explicit and expose it to users. >>> >>> ## Background >>> Jiangang & his colleagues at Kuaishou maintain an internal version of >>> Flink. One of their custom features is allowing dynamically changing >>> operator behaviors via the REST APIs. He's willing to contribute this >>> feature to the community, and came to Yun Gao and me for suggestions. After >>> discussion, we feel that the underlying question to be answered is how do >>> we model the control flow in Flink. Dynamically controlling jobs via REST >>> API can be one of the features built on top of the control flow, and there >>> could be others. >>> >>> ## Control flow >>> Control flow refers to the communication channels for sending >>> events/signals to/between tasks/operators, that changes Flink's behavior in >>> a way that may or may not affect the computation logic. Typical control >>> events/signals Flink currently has are watermarks and checkpoint barriers. >>> >>> In general, for modeling control flow, the following questions should be >>> considered. >>> 1. Who (which component) is responsible for generating the control >>> messages? >>> 2. Who (which component) is responsible for reacting to the messages. >>> 3. How do the messages propagate? >>> 4. When it comes to affecting the computation logics, how should the >>> control flow work together with the exact-once consistency. >>> >>> 1) & 2) may vary depending on the use cases, while 3) & 4) probably >>> share many things in common. A unified control flow model would help >>> deduplicate the common logics, allowing us to focus on the use case >>> specific parts. >>> >>> E.g., >>> - Watermarks: generated by source operators, handled by window operators. >>> - Checkpoint barrier: generated by the checkpoint coordinator, handled >>> by all tasks >>> - Dynamic controlling: generated by JobMaster (in reaction to the REST >>> command), handled by specific operators/UDFs >>> - Operator defined events: The following features are still in planning, >>> but may potentially benefit from the control flow model. (Please correct me >>> if I'm wrong, @Yun, @Jark) >>> * Iteration: When a certain condition is met, we might want to signal >>> downstream operators with an event >>> * Mini-batch assembling: Flink currently uses special watermarks for >>> indicating the end of each mini-batch, which makes it tricky to deal with >>> event time related computations. >>> * Hive dimension table join: For periodically reloaded hive tables, it >>> would be helpful to have specific events signaling that a reloading is >>> finished. >>> * Bootstrap dimension table join: This is similar to the previous one. >>> In cases where we want to fully load the dimension table before starting >>> joining the mainstream, it would be helpful to have an event signaling the >>> finishing of the bootstrap. >>> >>> ## Dynamic REST controlling >>> Back to the specific feature that Jiangang proposed, I personally think >>> it's quite convenient. Currently, to dynamically change the behavior of an >>> operator, we need to set up a separate source for the control events and >>> leverage broadcast state. Being able to send the events via REST APIs >>> definitely improves the usability. >>> >>> Leveraging dynamic configuration frameworks is for sure one possible >>> approach. The reason we are in favor of introducing the control flow is >>> that: >>> - It benefits not only this specific dynamic controlling feature, but >>> potentially other future features as well. >>> - AFAICS, it's non-trivial to make a 3rd-party dynamic configuration >>> framework work together with Flink's consistency mechanism. >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> >>> On Mon, Jun 7, 2021 at 11:05 AM 刘建刚 <[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=44245&i=0>> wrote: >>> >>>> Thank you for the reply. I have checked the post you mentioned. The >>>> dynamic config may be useful sometimes. But it is hard to keep data >>>> consistent in flink, for example, what if the dynamic config will take >>>> effect when failover. Since dynamic config is a desire for users, maybe >>>> flink can support it in some way. >>>> >>>> For the control mode, dynamic config is just one of the control modes. >>>> In the google doc, I have list some other cases. For example, control >>>> events are generated in operators or external services. Besides user's >>>> dynamic config, flink system can support some common dynamic configuration, >>>> like qps limit, checkpoint control and so on. >>>> >>>> It needs good design to handle the control mode structure. Based on >>>> that, other control features can be added easily later, like changing log >>>> level when job is running. In the end, flink will not just process data, >>>> but also interact with users to receive control events like a service. >>>> >>>> Steven Wu <[hidden email] >>>> <http:///user/SendEmail.jtp?type=node&node=44245&i=1>> 于2021年6月4日周五 >>>> 下午11:11写道: >>>> >>>>> I am not sure if we should solve this problem in Flink. This is more >>>>> like a dynamic config problem that probably should be solved by some >>>>> configuration framework. Here is one post from google search: >>>>> https://medium.com/twodigits/dynamic-app-configuration-inject-configuration-at-run-time-using-spring-boot-and-docker-ffb42631852a >>>>> >>>>> On Fri, Jun 4, 2021 at 7:09 AM 刘建刚 <[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=44245&i=2>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> Flink jobs are always long-running. When the job is running, >>>>>> users may want to control the job but not stop it. The control reasons >>>>>> can >>>>>> be different as following: >>>>>> >>>>>> 1. >>>>>> >>>>>> Change data processing’ logic, such as filter condition. >>>>>> 2. >>>>>> >>>>>> Send trigger events to make the progress forward. >>>>>> 3. >>>>>> >>>>>> Define some tools to degrade the job, such as limit input qps, >>>>>> sampling data. >>>>>> 4. >>>>>> >>>>>> Change log level to debug current problem. >>>>>> >>>>>> The common way to do this is to stop the job, do modifications >>>>>> and start the job. It may take a long time to recover. In some >>>>>> situations, >>>>>> stopping jobs is intolerable, for example, the job is related to money or >>>>>> important activities.So we need some technologies to control the >>>>>> running job without stopping the job. >>>>>> >>>>>> >>>>>> We propose to add control mode for flink. A control mode based on the >>>>>> restful interface is first introduced. It works by these steps: >>>>>> >>>>>> >>>>>> 1. The user can predefine some logic which supports config >>>>>> control, such as filter condition. >>>>>> 2. Run the job. >>>>>> 3. If the user wants to change the job's running logic, just send >>>>>> a restful request with the responding config. >>>>>> >>>>>> Other control modes will also be considered in the future. More >>>>>> introduction can refer to the doc >>>>>> https://docs.google.com/document/d/1WSU3Tw-pSOcblm3vhKFYApzVkb-UQ3kxso8c8jEzIuA/edit?usp=sharing >>>>>> . If the community likes the proposal, more discussion is needed and a >>>>>> more >>>>>> detailed design will be given later. Any suggestions and ideas are >>>>>> welcome. >>>>>> >>>>>> >>> >>> ------------------------------ >>> If you reply to this email, your message will be added to the discussion >>> below: >>> >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Add-control-mode-for-flink-tp44203p44245.html >>> To start a new topic under Apache Flink User Mailing List archive., >>> email ml+s2336050n1...@n4.nabble.com >>> To unsubscribe from Apache Flink User Mailing List archive., click here >>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=bGl1amlhbmdhbmdwZW5nQGdtYWlsLmNvbXwxfC0xMTYwNzM3MjI=> >>> . >>> NAML >>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >>> >>