Re: Re: Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-19 Thread Yun Gao
Hi Marco, I think Flink does not need 500GB for the source, the source should be able to read from S3 in a streaming pattern (namely open the file, create an input stream and fetch data as required). But it might indeed need disk spaces for intermediate data between operators and the sort operat

Re: Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-19 Thread Marco Villalobos
> On May 19, 2021, at 7:26 AM, Yun Gao wrote: > > Hi Marco, > > For the remaining issues, > > 1. For the aggregation, the 500GB of files are not required to be fit into > memory. > Rough speaking for the keyed().window().reduce(), the input records would be > first > sort according to the

Re: Questions Flink DataStream in BATCH execution mode scalability advice

2021-05-19 Thread Yun Gao
Hi Marco, For the remaining issues, 1. For the aggregation, the 500GB of files are not required to be fit into memory. Rough speaking for the keyed().window().reduce(), the input records would be first sort according to the key (time_series.name) via external sorts, which only consumes a fix