Re: Is there any way to set the parallelism of operators like group by, join?

Jeff Zhang Sat, 15 Apr 2023 17:13:58 -0700

Thanks Reuven, what I mean is to set the parallelism in operator level. And
the input size of the operator is unknown at compiling stage if it is not a
source
 operator,


Here's an example of flink
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
Spark also support to set operator level parallelism (see groupByKey and
reduceByKey):
https://spark.apache.org/docs/latest/rdd-programming-guide.html


On Sun, Apr 16, 2023 at 1:42 AM Reuven Lax via user <user@beam.apache.org>
wrote:

> The maximum parallelism is always determined by the parallelism of your
> data. If you do a GroupByKey for example, the number of keys in your data
> determines the maximum parallelism.
>
> Beyond the limitations in your data, it depends on your execution engine.
> If you're using Dataflow, Dataflow is designed to automatically determine
> the parallelism (e.g. work will be dynamically split and moved around
> between workers, the number of workers will autoscale, etc.), so there's no
> need to explicitly set the parallelism of the execution.
>
> On Sat, Apr 15, 2023 at 1:12 AM Jeff Zhang <zjf...@gmail.com> wrote:
>
>> Besides the global parallelism of beam job, is there any way to set
>> parallelism for individual operators like group by and join? I
>> understand the parallelism setting depends on the underlying execution
>> engine, but it is very common to set parallelism like group by and join in
>> both spark & flink.
>>
>>
>>
>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

-- 
Best Regards

Jeff Zhang

Re: Is there any way to set the parallelism of operators like group by, join?

Reply via email to