Hi! I want to implement a Flink cluster as a native Kubernetes session cluster, with intention of executing Apache Beam jobs that will process only batch data, but I am not sure I understand how I would scale the cluster if I need to process large datasets.
My understanding is that to be able to process a bigger dataset, you could run it with higher parallelism, so the processing will be spread on multiple task slots, which might run multiple nodes. But running Beam jobs which actually in my case execute TensorFlow Extended pipelines, I am not able to have control over partitioning over some keys and I don't see any difference in throughput (the time it takes to process specific dataset), if I use parallelism of 2 or 4 - it takes the same time. Also, does it mean if I want to process a dataset of any size since the execution is of type "PIPELINED", does this mean, if I don't increase parallelism and just run the job on a fixed number of task slots, the job will fail (due to lack of memory on the task manager)or it will just take longer time to process the data? Thanks, Gorjan