On Tue, Jan 21, 2025 at 4:51 PM Robert Bradshaw via dev <dev@beam.apache.org>
wrote:

> On Tue, Jan 21, 2025 at 7:26 AM Kenneth Knowles <k...@apache.org> wrote:
> >
> > On Tue, Jan 21, 2025 at 2:35 AM Jan Lukavský <je...@seznam.cz> wrote:
> >>
> >>
> >> On 1/20/25 18:18, Kenneth Knowles wrote:
> >>
> >> This all sounds good. I will add my standard comment that this hint is
> a property of the data, not the pipeline logic. So it is a different type
> of hint than key invariance and fanout ratio).
> >>
> >> This is not a problem for the proposed approach, in my opinion.
> Obviously, almost always there will be some pipeline code that is written
> specifically for the data in mind.
> >>
> >> A couple other examples of "hints" that you can keep in mind are
> Combine.withHotkeyFanout, Redistribute (both variants), and
> GroupByKey.fewKeys. These were chosen to be expressed as transforms, even
> though they are more like hints. I bring up these examples to say that we
> don't have to be too pedantic here, because it is already too late :-)
> >>
> >> And anyhow a runner is always allowed to implement any piece of a
> pipeline with anything that has the same "behavior", whether or not it is
> expressed as a hint or some other way (that's the whole point of Beam, and
> how we have fusion, combiner lifting, flatten unzipping, multiple runners,
> etc).
> >>
> >> It is probably less appropriate for reusable transforms that are
> expected to be used in more than one context. That can be up to
> transform/pipeline authors.
> >>
> >> And to bring it back around and connect to the above: having an API
> like GroupByKey.biggerThanMemory() as an API choice is just as fine with me
> as GroupByKey.fewKeys() and it can just be a composite that adds the
> hint/annotation to the primitive node. No need to combine API design with
> model design / no need to force users to express things in terms of the
> lowest level parts of the model.
> >>
> >> I was thinking about that as well. But there is a problem. The GBK is
> often part of some other transform (e.g. FileIO, but can be any other). We
> need a way to (optionally) change the behavior of a transform that is part
> of some outer composite. Therefore this should work for
> >>
> >> FileIO.write(...).addAnnotation(GroupByKey.HUGE)
> >
> > It is a very good point that library transforms should probably not be
> annotated but they do need to be adjusted when executed. FWIW this is also
> why windowing strategy is on PCollection and automatically propagated.
> >
> > But also another good example: FileIO has GBKs that are small even if
> the data incoming is huge. In the analogy with windowing strategy, the
> library transform has to own the re-windowing / re-sizing.
> >
> > So maybe PCollection.addAnnotation(SizeEstimate.HUGE) could make more
> sense.
>
> This doesn't solve the problem, as the operation you're trying to
> modify may be entirely internal to the composite. (Unless we have
> annotations that get attached to inputs and "follow" through like
> windowing, with operations that can add/remove/modify these
> annotations.)
>
> Being able to annotate a composite and having it apply (per runner
> semantics) to all subtransforms doesn't seem too bad. If you really
> need to have part of the transform be executed one way, and part
> another, that feels like you need to break apart (re-implement) the
> transform itself.
>

Or a transform (GBK in this case) should be able to perform a call to get
the aggregated set of annotations of composites that surround it. That way
you can just add the annotation to the outermost transform (FileIO in this
case) and the GBK should be able to change the behavior based on that.

- Cham


>
> > But then it starts to look like a lot of manual propagation of
> annotations (if we don't make it default) or a lot of manually undoing
> annotations (if we do make it default).
>
> And that too.
>

Reply via email to