Hi,
Thanks Xuannan for the clarification, I also have no other issues~
Best,
Yun
--Original Mail --
Sender:Xuannan Su
Send Date:Wed Jan 19 11:35:13 2022
Recipients:Flink Dev
Subject:Re: [DISCUSS] FLIP-205: Support cache in DataStream for Batch Processing
Hi
Hi devs,
Thank you all for the discussion.
If there are no objections or feedback, I would like to start the vote
thread tomorrow.
Best,
Xuannan
On Tue, Jan 18, 2022 at 8:12 PM Xuannan Su wrote:
>
> Hi Yun,
>
> Thanks for your questions.
>
> 1. I think the execute_and_collect is an API on the D
Hi Yun,
Thanks for your questions.
1. I think the execute_and_collect is an API on the DataStream, which
adds a collect sink to the DataStream and invokes
StreamExecutionEnvironment#execute. It is a convenient method to
execute a job and get an iterator of the result.
2. As our offline discussio
+1 Thanks Xuannan, no more objections from my side
On Mon, Jan 17, 2022 at 7:14 AM Yun Gao
wrote:
> Hi Xuannan,
>
> Very thanks for the detailed explanation and sorry for the very
> late response.
>
> For cached result partition v.s. managed table, I also agree with
> the current conclusion that
Hi Xuannan,
Very thanks for the detailed explanation and sorry for the very
late response.
For cached result partition v.s. managed table, I also agree with
the current conclusion that they could be differentiated at the moment:
cached result partition could be viewed as an internal, lightweig
Hi David,
Thanks for pointing out the FLIP-187. After reading the FLIP, I think it
can solve the problem of choosing the proper parallelism, and thus it
should be fine to not provide the method to set the parallelism of the
cache.
And you understand of the outcome of this FLIP is correct.
If the
Hi Xuannan,
I think this already looks really good. The whole discussions is pretty
long, so I'll just to summarize my current understanding of the outcome:
- This only aims on the DataStream API for now, but can be used as a
building block for the higher level abstractions (Table API).
- We're p
Hi Gen,
Thanks for your feedback.
I think you are talking about how we are going to store the caching
data. The first option is to write the data with a sink to an external
file system, much like the file store of the Dynamic Table. If I
understand correctly, it requires a distributed file system
Hi Xuannan,
Thanks for the reply.
I do agree that dynamic tables and cached partitions are similar features
aiming different cases. In my opinion, the main difference of the
implementations is to cache only the data or the whole result partition.
To cache only the data, we can translate the Cac
Hi Yun,
Thanks for your feedback!
1. With the cached stream the compile and job submission happens as a
regular job submission. And a job with multiple concurrent cached
DataStream is supported. For your example, a and b are run in the same
job. Thus, all the cached DataStream are created when th
Hi Xuannnan,
Thanks for the reply.
Regarding whether and how to support cache sideoutput, I agree that the
second option might be better if there do exist a use case that users need
to cache only some certain side outputs.
Xuannan Su 于2022年1月4日周二 15:50写道:
> Hi Zhipeng and Gen,
>
> Thanks for
Hi Xuannan,
Very thanks for drafting the FLIP and initiating the discussion!
I have several issues, sorry if I have misunderstandings:
1. With the cached stream, when would the compile and job submission
happens? Does it happen on calling execute_and_cache() ? If so, could
we support the job wit
Hi David,
We have a description in the FLIP about the case of TM failure without
the remote shuffle service. Basically, since the partitions are stored
at the TM, a TM failure requires recomputing the intermediate result.
If a Flink job uses the remote shuffle service, the partitions are
stored a
Hi Zhipeng and Gen,
Thanks for joining the discussion.
For Zhipeng:
- Can we support side output
Caching the side output is indeed a valid use case. However, with the
current API, it is not straightforward to cache the side output. You
can apply an identity map function to the DataStream returne
One more question from my side, should we make sure this plays well with
the remote shuffle service [1] in case of TM failure?
[1] https://github.com/flink-extended/flink-remote-shuffle
D.
On Thu, Dec 30, 2021 at 11:59 AM Gen Luo wrote:
> Hi Xuannan,
>
> I found FLIP-188[1] that is aiming to i
Hi Xuannan,
I found FLIP-188[1] that is aiming to introduce a built-in dynamic table
storage, which provides a unified changelog & table representation. Tables
stored there can be used in further ad-hoc queries. To my understanding,
it's quite like an implementation of caching in Table API, and th
Hi Xuannan,
Thanks for starting the discussion. This would certainly help a lot on both
efficiency and reproducibility in machine learning cases :)
I have a few questions as follows:
1. Can we support caching both the output and sideoutputs of a
SingleOutputStreamOperator (which I believe is a r
Hi David,
Thanks for sharing your thoughts.
You are right that most people tend to use high-level API for
interactive data exploration. Actually, there is
the FLIP-36 [1] covering the cache API at Table/SQL API. As far as I
know, it has been accepted but hasn’t been implemented. At the time
when
Hi Xuannan,
thanks for drafting this FLIP.
One immediate thought, from what I've seen for interactive data exploration
with Spark, most people tend to use the higher level APIs, that allow for
faster prototyping (Table API in Flink's case). Should the Table API also
be covered by this FLIP?
Best
19 matches
Mail list logo