Hi,
as part of reviewing [1], I came across a question, which might be
solved using resource hints. Seems the usage of these hints is currently
limited, though. I'll explain the case in a few points:
a) a generic implementation of GBK on Spark assumes that all values
fit into memory
b) this can be changed to implementation which uses Spark's internal
sorting mechanism to group by key and window without the need for the
data to fit into memory
c) this optimization can be more expensive for cases where a) is
sufficient
There is currently no simple way of knowing if a GBK fits to memory or
not. This could be solved using ResourceHints, e.g.:
input.apply(GroupByKey.create().setResourceHints(ResourceHints.huge()))
The expansion could then pick only the appropriate transforms, but it
requires changing the generic ResourceHints class. Is this intentional
and the expected approach? We can create pipeline-level hints, but this
seems not correct in this situation. Should we introduce some
runner-specific way of creating hints applicable only to a specific runner?
Alternative option seems to be somewhat similar concept of
"annotations", which seems to be introduced and currently used only for
error handlers.
Thanks for any opinions!
Jan
[1] https://github.com/apache/beam/pull/33521