Thanks, I'll check more about job tuning. On Mon, 16 Aug 2021 at 06:28, Caizhi Weng <tsreape...@gmail.com> wrote:
> Hi! > > if I use parallelism of 2 or 4 - it takes the same time. >> > It might be that there is no data in some parallelisms. You can click on > the nodes in Flink web UI and see if it is the case for each parallelism, > or you can check out the metrics of each operator. > > if I don't increase parallelism and just run the job on a fixed number of >> task slots, the job will fail (due to lack of memory on the task manager)or >> it will just take longer time to process the data? >> > It depends on a lot of aspects, such as the type of source you are using, > the type of operators you are running, etc. Ideally we hope it will just > take longer but for some specific operators or connectors it might fail. > This is where users have to tune their jobs. > > Gorjan Todorovski <gor...@gmail.com> 于2021年8月13日周五 下午6:48写道: > >> Hi! >> >> I want to implement a Flink cluster as a native Kubernetes session >> cluster, with intention of executing Apache Beam jobs that will process >> only batch data, but I am not sure I understand how I would scale the >> cluster if I need to process large datasets. >> >> My understanding is that to be able to process a bigger dataset, you >> could run it with higher parallelism, so the processing will be spread on >> multiple task slots, which might run multiple nodes. >> But running Beam jobs which actually in my case execute TensorFlow >> Extended pipelines, I am not able to have control over partitioning over >> some keys and I don't see any difference in throughput (the time it takes >> to process specific dataset), if I use parallelism of 2 or 4 - it takes the >> same time. >> >> Also, does it mean if I want to process a dataset of any size since the >> execution is of type "PIPELINED", does this mean, if I don't increase >> parallelism and just run the job on a fixed number of task slots, the job >> will fail (due to lack of memory on the task manager)or it will just take >> longer time to process the data? >> >> Thanks, >> Gorjan >> >