Re: Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-19 Thread Yun Gao
Hi, Thanks Xuannan for the clarification, I also have no other issues~ Best, Yun --Original Mail -- Sender:Xuannan Su Send Date:Wed Jan 19 11:35:13 2022 Recipients:Flink Dev Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing Hi

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-18 Thread Xuannan Su
Hi devs, Thank you all for the discussion. If there are no objections or feedback, I would like to start the vote thread tomorrow. Best, Xuannan On Tue, Jan 18, 2022 at 8:12 PM Xuannan Su wrote: > > Hi Yun, > > Thanks for your questions. > > 1. I think the execute_and_collect is an API on the D

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-18 Thread Xuannan Su
Hi Yun, Thanks for your questions. 1. I think the execute_and_collect is an API on the DataStream, which adds a collect sink to the DataStream and invokes StreamExecutionEnvironment#execute. It is a convenient method to execute a job and get an iterator of the result. 2. As our offline discussio

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-17 Thread David Morávek
+1 Thanks Xuannan, no more objections from my side On Mon, Jan 17, 2022 at 7:14 AM Yun Gao wrote: > Hi Xuannan, > > Very thanks for the detailed explanation and sorry for the very > late response. > > For cached result partition v.s. managed table, I also agree with > the current conclusion that

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-16 Thread Yun Gao
Hi Xuannan, Very thanks for the detailed explanation and sorry for the very late response. For cached result partition v.s. managed table, I also agree with the current conclusion that they could be differentiated at the moment: cached result partition could be viewed as an internal, lightweig

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-16 Thread Xuannan Su
Hi David, Thanks for pointing out the FLIP-187. After reading the FLIP, I think it can solve the problem of choosing the proper parallelism, and thus it should be fine to not provide the method to set the parallelism of the cache. And you understand of the outcome of this FLIP is correct. If the

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-14 Thread David Morávek
Hi Xuannan, I think this already looks really good. The whole discussions is pretty long, so I'll just to summarize my current understanding of the outcome: - This only aims on the DataStream API for now, but can be used as a building block for the higher level abstractions (Table API). - We're p

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-10 Thread Xuannan Su
Hi Gen, Thanks for your feedback. I think you are talking about how we are going to store the caching data. The first option is to write the data with a sink to an external file system, much like the file store of the Dynamic Table. If I understand correctly, it requires a distributed file system

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-06 Thread Gen Luo
Hi Xuannan, Thanks for the reply. I do agree that dynamic tables and cached partitions are similar features aiming different cases. In my opinion, the main difference of the implementations is to cache only the data or the whole result partition. To cache only the data, we can translate the Cac

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Xuannan Su
Hi Yun, Thanks for your feedback! 1. With the cached stream the compile and job submission happens as a regular job submission. And a job with multiple concurrent cached DataStream is supported. For your example, a and b are run in the same job. Thus, all the cached DataStream are created when th

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Zhipeng Zhang
Hi Xuannnan, Thanks for the reply. Regarding whether and how to support cache sideoutput, I agree that the second option might be better if there do exist a use case that users need to cache only some certain side outputs. Xuannan Su 于2022年1月4日周二 15:50写道: > Hi Zhipeng and Gen, > > Thanks for

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-05 Thread Yun Gao
Hi Xuannan, Very thanks for drafting the FLIP and initiating the discussion! I have several issues, sorry if I have misunderstandings: 1. With the cached stream, when would the compile and job submission happens? Does it happen on calling execute_and_cache() ? If so, could we support the job wit

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-04 Thread Xuannan Su
Hi David, We have a description in the FLIP about the case of TM failure without the remote shuffle service. Basically, since the partitions are stored at the TM, a TM failure requires recomputing the intermediate result. If a Flink job uses the remote shuffle service, the partitions are stored a

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-03 Thread Xuannan Su
Hi Zhipeng and Gen, Thanks for joining the discussion. For Zhipeng: - Can we support side output Caching the side output is indeed a valid use case. However, with the current API, it is not straightforward to cache the side output. You can apply an identity map function to the DataStream returne

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2022-01-03 Thread David Morávek
One more question from my side, should we make sure this plays well with the remote shuffle service [1] in case of TM failure? [1] https://github.com/flink-extended/flink-remote-shuffle D. On Thu, Dec 30, 2021 at 11:59 AM Gen Luo wrote: > Hi Xuannan, > > I found FLIP-188[1] that is aiming to i

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-30 Thread Gen Luo
Hi Xuannan, I found FLIP-188[1] that is aiming to introduce a built-in dynamic table storage, which provides a unified changelog & table representation. Tables stored there can be used in further ad-hoc queries. To my understanding, it's quite like an implementation of caching in Table API, and th

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-30 Thread Zhipeng Zhang
Hi Xuannan, Thanks for starting the discussion. This would certainly help a lot on both efficiency and reproducibility in machine learning cases :) I have a few questions as follows: 1. Can we support caching both the output and sideoutputs of a SingleOutputStreamOperator (which I believe is a r

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-29 Thread Xuannan Su
Hi David, Thanks for sharing your thoughts. You are right that most people tend to use high-level API for interactive data exploration. Actually, there is the FLIP-36 [1] covering the cache API at Table/SQL API. As far as I know, it has been accepted but hasn’t been implemented. At the time when

Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing

2021-12-29 Thread David Morávek
Hi Xuannan, thanks for drafting this FLIP. One immediate thought, from what I've seen for interactive data exploration with Spark, most people tend to use the higher level APIs, that allow for faster prototyping (Table API in Flink's case). Should the Table API also be covered by this FLIP? Best