Re: Using resource hints or annotations for transform expansion

Danny McCormick via dev Tue, 14 Jan 2025 07:36:40 -0800

In my opinion, what you are describing fits the intention/current behavior
of resource hints. Resource hints are just hints which allow the runner to
optimize the execution environment where possible, so it should be legal
for any runner to ignore any hints; as long as we're maintaining that
behavior, I think it is ok.

> Should we introduce some runner-specific way of creating hints applicable
only to a specific runner?

IMO this just makes the pipeline less portable and doesn't really do
anything to make switching runners easier. Ideally I could have a pipeline
with a set of hints, some of which apply to only Spark, some of which apply
to only Flink, and some of which apply only to Dataflow, and the pipeline
should be fully portable across those environments without making
modifications. Your use case fits this paradigm well since running
input.apply(GroupByKey.create().setResourceHints(ResourceHints.huge())) on
any non-Spark runner should still work fine (assuming the runner has an
out-of-memory GBK implementation by default.

It would, however, be nice to at least have a matrix where we document
which resource hints impact which runners.

Thanks,
Danny

On Tue, Jan 14, 2025 at 6:02 AM Jan Lukavský <[email protected]> wrote:

> Hi,
>
> as part of reviewing [1], I came across a question, which might be
> solved using resource hints. Seems the usage of these hints is currently
> limited, though. I'll explain the case in a few points:
>
>   a) a generic implementation of GBK on Spark assumes that all values
> fit into memory
>
>   b) this can be changed to implementation which uses Spark's internal
> sorting mechanism to group by key and window without the need for the
> data to fit into memory
>
>   c) this optimization can be more expensive for cases where a) is
> sufficient
>
> There is currently no simple way of knowing if a GBK fits to memory or
> not. This could be solved using ResourceHints, e.g.:
>
> input.apply(GroupByKey.create().setResourceHints(ResourceHints.huge()))
>
> The expansion could then pick only the appropriate transforms, but it
> requires changing the generic ResourceHints class. Is this intentional
> and the expected approach? We can create pipeline-level hints, but this
> seems not correct in this situation. Should we introduce some
> runner-specific way of creating hints applicable only to a specific runner?
>
> Alternative option seems to be somewhat similar concept of
> "annotations", which seems to be introduced and currently used only for
> error handlers.
>
> Thanks for any opinions!
>   Jan
>
> [1] https://github.com/apache/beam/pull/33521
>
>

Re: Using resource hints or annotations for transform expansion

Reply via email to