Hi!

I want to implement a Flink cluster as a native Kubernetes session cluster,
with intention of executing Apache Beam jobs that will process only batch
data, but I am not sure I understand how I would scale the cluster if I
need to process large datasets.

My understanding is that to be able to process a bigger dataset, you could
run it with higher parallelism, so the processing will be spread on
multiple task slots, which might run multiple nodes.
But running Beam jobs which actually in my case execute TensorFlow Extended
pipelines, I am not able to have control over partitioning over some keys
and I don't see any difference in throughput (the time it takes to process
specific dataset), if I use parallelism of 2 or 4 - it takes the same time.

Also, does it mean if I want to process a dataset of any size since the
execution is of type "PIPELINED", does this mean, if I don't increase
parallelism and just run the job on a fixed number of task slots, the job
will fail (due to lack of memory on the task manager)or it will just take
longer time to process the data?

Thanks,
Gorjan

Reply via email to