Thanks Reuven, what I mean is to set the parallelism in operator level. And the input size of the operator is unknown at compiling stage if it is not a source operator,
Here's an example of flink https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level Spark also support to set operator level parallelism (see groupByKey and reduceByKey): https://spark.apache.org/docs/latest/rdd-programming-guide.html On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <user@beam.apache.org> wrote: > The maximum parallelism is always determined by the parallelism of your > data. If you do a GroupByKey for example, the number of keys in your data > determines the maximum parallelism. > > Beyond the limitations in your data, it depends on your execution engine. > If you're using Dataflow, Dataflow is designed to automatically determine > the parallelism (e.g. work will be dynamically split and moved around > between workers, the number of workers will autoscale, etc.), so there's no > need to explicitly set the parallelism of the execution. > > On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote: > >> Besides the global parallelism of beam job, is there any way to set >> parallelism for individual operators like group by and join? I >> understand the parallelism setting depends on the underlying execution >> engine, but it is very common to set parallelism like group by and join in >> both spark & flink. >> >> >> >> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > -- Best Regards Jeff Zhang