Same need here, using Flink runner. We are processing a
pcollection (extracting features per element) then
combining these into groups of features and running the
next operator on those groups.
Each group contains ~50 elements, so the parallelism of
the operator upstream of the groupby should be higher,
to be balanced with the downstream operator.
On Tue, Apr 18, 2023 at 19:17 Jeff Zhang
<zjf...@gmail.com> wrote:
Hi Reuven,
It would be better to set parallelism for operators,
as I mentioned before, there may be multiple
groupby, join operators in one pipeline, and their
parallelism can be different due to different input
data sizes.
On Wed, Apr 19, 2023 at 3:59 AM Reuven Lax
<re...@google.com> wrote:
Jeff - does setting the global default work for
you, or do you need per-operator control? Seems
like it would be to add this to ResourceHints.
On Tue, Apr 18, 2023 at 12:35 PM Robert Bradshaw
<rober...@google.com> wrote:
Yeah, I don't think we have a good
per-operator API for this. If we were to add
it, it probably belongs in ResourceHints.
On Sun, Apr 16, 2023 at 11:28 PM Reuven Lax
<re...@google.com> wrote:
Looking at FlinkPipelineOptions, there
is a parallelism option you can set. I
believe this sets the default
parallelism for all Flink operators.
On Sun, Apr 16, 2023 at 7:20 PM Jeff
Zhang <zjf...@gmail.com> wrote:
Thanks Holden, this would work for
Spark, but Flink doesn't have such
kind of mechanism, so I am looking
for a general solution on the beam side.
On Mon, Apr 17, 2023 at 10:08 AM
Holden Karau <hol...@pigscanfly.ca>
wrote:
To a (small) degree Sparks “new”
AQE might be able to help
depending on what kind of
operations Beam is compiling it
down to.
Have you tried setting
spark.sql.adaptive.enabled &
spark.sql.adaptive.coalescePartitions.enabled
On Mon, Apr 17, 2023 at 10:34 AM
Reuven Lax via user
<u...@beam.apache.org> wrote:
I see. Robert - what is the
story for parallelism
controls on GBK with the
Spark or Flink runners?
On Sun, Apr 16, 2023 at
6:24 PM Jeff Zhang
<zjf...@gmail.com> wrote:
No, I don't use
dataflow, I use Spark &
Flink.
On Mon, Apr 17, 2023 at
8:08 AM Reuven Lax
<re...@google.com> wrote:
Are you running on
the Dataflow runner?
If so, Dataflow -
unlike Spark and
Flink - dynamically
modifies the
parallelism as the
operator runs, so
there is no need to
have such controls.
In fact these
specific controls
wouldn't make much
sense for the way
Dataflow implements
these operators.
On Sun, Apr 16, 2023
at 12:25 AM Jeff
Zhang
<zjf...@gmail.com>
wrote:
Just for
performance tuning like
in Spark and Flink.
On Sun, Apr 16,
2023 at 1:10 PM
Robert Bradshaw
via user
<u...@beam.apache.org>
wrote:
What are you
trying to
achieve by
setting the
parallelism?
On Sat, Apr
15, 2023 at
5:13 PM Jeff
Zhang
<zjf...@gmail.com>
wrote:
Thanks
Reuven,
what I
mean is
to set
the
parallelism
in
operator
level.
And the
input
size of
the
operator
is
unknown
at
compiling
stage if
it is
not a
source
operator,
Here's
an
example
of flink
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/parallel/#operator-level
Spark also
support
to set
operator
level
parallelism
(see
groupByKey
and
reduceByKey):
https://spark.apache.org/docs/latest/rdd-programming-guide.html
On Sun,
Apr 16,
2023 at
1:42 AM
Reuven
Lax via
user
<u...@beam.apache.org>
wrote:
The
maximum
parallelism is
always
determined
by
the
parallelism
of
your
data.
If
you
do a
GroupByKey
for
example,
the
number
of
keys
in
your
data
determines
the
maximum
parallelism.
Beyond
the
limitations
in
your
data,
it
depends
on
your
execution
engine.
If
you're
using
Dataflow,
Dataflow
is
designed
to
automatically
determine
the
parallelism
(e.g.
work
will
be
dynamically
split
and
moved
around
between
workers,
the
number
of
workers
will
autoscale,
etc.),
so
there's
no
need
to
explicitly
set
the
parallelism
of
the
execution.
On
Sat,
Apr
15,
2023
at
1:12 AM
Jeff
Zhang
<zjf...@gmail.com>
wrote:
Besides
the
global
parallelism
of
beam
job,
is
there
any
way
to
set
parallelism
for
individual
operators
like
group
by
and
join?
I
understand
the
parallelism
setting
depends
on
the
underlying
execution
engine,
but
it
is
very
common
to
set
parallelism
like
group
by
and
join
in
both
spark
&
flink.
--
Best
Regards
Jeff
Zhang
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang
--
Twitter:
https://twitter.com/holdenkarau
Books (Learning Spark, High
Performance Spark, etc.):
https://amzn.to/2MaRAG9
<https://amzn.to/2MaRAG9>
YouTube Live Streams:
https://www.youtube.com/user/holdenkarau
--
Best Regards
Jeff Zhang
--
Best Regards
Jeff Zhang