Hi,

as part of reviewing [1], I came across a question, which might be solved using resource hints. Seems the usage of these hints is currently limited, though. I'll explain the case in a few points:

 a) a generic implementation of GBK on Spark assumes that all values fit into memory

 b) this can be changed to implementation which uses Spark's internal sorting mechanism to group by key and window without the need for the data to fit into memory

 c) this optimization can be more expensive for cases where a) is sufficient

There is currently no simple way of knowing if a GBK fits to memory or not. This could be solved using ResourceHints, e.g.:

input.apply(GroupByKey.create().setResourceHints(ResourceHints.huge()))

The expansion could then pick only the appropriate transforms, but it requires changing the generic ResourceHints class. Is this intentional and the expected approach? We can create pipeline-level hints, but this seems not correct in this situation. Should we introduce some runner-specific way of creating hints applicable only to a specific runner?

Alternative option seems to be somewhat similar concept of "annotations", which seems to be introduced and currently used only for error handlers.

Thanks for any opinions!
 Jan

[1] https://github.com/apache/beam/pull/33521

Reply via email to