On Tue, Jan 21, 2025 at 7:26 AM Kenneth Knowles <k...@apache.org> wrote:
>
> On Tue, Jan 21, 2025 at 2:35 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>
>> On 1/20/25 18:18, Kenneth Knowles wrote:
>>
>> This all sounds good. I will add my standard comment that this hint is a 
>> property of the data, not the pipeline logic. So it is a different type of 
>> hint than key invariance and fanout ratio).
>>
>> This is not a problem for the proposed approach, in my opinion.  Obviously, 
>> almost always there will be some pipeline code that is written specifically 
>> for the data in mind.
>>
>> A couple other examples of "hints" that you can keep in mind are 
>> Combine.withHotkeyFanout, Redistribute (both variants), and 
>> GroupByKey.fewKeys. These were chosen to be expressed as transforms, even 
>> though they are more like hints. I bring up these examples to say that we 
>> don't have to be too pedantic here, because it is already too late :-)
>>
>> And anyhow a runner is always allowed to implement any piece of a pipeline 
>> with anything that has the same "behavior", whether or not it is expressed 
>> as a hint or some other way (that's the whole point of Beam, and how we have 
>> fusion, combiner lifting, flatten unzipping, multiple runners, etc).
>>
>> It is probably less appropriate for reusable transforms that are expected to 
>> be used in more than one context. That can be up to transform/pipeline 
>> authors.
>>
>> And to bring it back around and connect to the above: having an API like 
>> GroupByKey.biggerThanMemory() as an API choice is just as fine with me as 
>> GroupByKey.fewKeys() and it can just be a composite that adds the 
>> hint/annotation to the primitive node. No need to combine API design with 
>> model design / no need to force users to express things in terms of the 
>> lowest level parts of the model.
>>
>> I was thinking about that as well. But there is a problem. The GBK is often 
>> part of some other transform (e.g. FileIO, but can be any other). We need a 
>> way to (optionally) change the behavior of a transform that is part of some 
>> outer composite. Therefore this should work for
>>
>> FileIO.write(...).addAnnotation(GroupByKey.HUGE)
>
> It is a very good point that library transforms should probably not be 
> annotated but they do need to be adjusted when executed. FWIW this is also 
> why windowing strategy is on PCollection and automatically propagated.
>
> But also another good example: FileIO has GBKs that are small even if the 
> data incoming is huge. In the analogy with windowing strategy, the library 
> transform has to own the re-windowing / re-sizing.
>
> So maybe PCollection.addAnnotation(SizeEstimate.HUGE) could make more sense.

This doesn't solve the problem, as the operation you're trying to
modify may be entirely internal to the composite. (Unless we have
annotations that get attached to inputs and "follow" through like
windowing, with operations that can add/remove/modify these
annotations.)

Being able to annotate a composite and having it apply (per runner
semantics) to all subtransforms doesn't seem too bad. If you really
need to have part of the transform be executed one way, and part
another, that feels like you need to break apart (re-implement) the
transform itself.

> But then it starts to look like a lot of manual propagation of annotations 
> (if we don't make it default) or a lot of manually undoing annotations (if we 
> do make it default).

And that too.

Reply via email to