Using resource hints or annotations for transform expansion

Jan Lukavský Tue, 14 Jan 2025 03:02:26 -0800

Hi,

as part of reviewing [1], I came across a question, which might besolved using resource hints. Seems the usage of these hints is currentlylimited, though. I'll explain the case in a few points:

a) a generic implementation of GBK on Spark assumes that all valuesfit into memory

b) this can be changed to implementation which uses Spark's internalsorting mechanism to group by key and window without the need for thedata to fit into memory

c) this optimization can be more expensive for cases where a) issufficient

There is currently no simple way of knowing if a GBK fits to memory ornot. This could be solved using ResourceHints, e.g.:


input.apply(GroupByKey.create().setResourceHints(ResourceHints.huge()))

The expansion could then pick only the appropriate transforms, but itrequires changing the generic ResourceHints class. Is this intentionaland the expected approach? We can create pipeline-level hints, but thisseems not correct in this situation. Should we introduce somerunner-specific way of creating hints applicable only to a specific runner?

Alternative option seems to be somewhat similar concept of"annotations", which seems to be introduced and currently used only forerror handlers.


Thanks for any opinions!
 Jan

[1] https://github.com/apache/beam/pull/33521

Using resource hints or annotations for transform expansion

Reply via email to