On Thu, Jan 23, 2025 at 10:39 AM Kenneth Knowles <k...@apache.org> wrote:
>
> On Tue, Jan 21, 2025 at 7:51 PM Robert Bradshaw via dev <dev@beam.apache.org> 
> wrote:
>>
>> On Tue, Jan 21, 2025 at 7:26 AM Kenneth Knowles <k...@apache.org> wrote:
>> >
>> > On Tue, Jan 21, 2025 at 2:35 AM Jan Lukavský <je...@seznam.cz> wrote:
>> >>
>> >>
>> >> On 1/20/25 18:18, Kenneth Knowles wrote:
>> >>
>> >> This all sounds good. I will add my standard comment that this hint is a 
>> >> property of the data, not the pipeline logic. So it is a different type 
>> >> of hint than key invariance and fanout ratio).
>> >>
>> >> This is not a problem for the proposed approach, in my opinion.  
>> >> Obviously, almost always there will be some pipeline code that is written 
>> >> specifically for the data in mind.
>> >>
>> >> A couple other examples of "hints" that you can keep in mind are 
>> >> Combine.withHotkeyFanout, Redistribute (both variants), and 
>> >> GroupByKey.fewKeys. These were chosen to be expressed as transforms, even 
>> >> though they are more like hints. I bring up these examples to say that we 
>> >> don't have to be too pedantic here, because it is already too late :-)
>> >>
>> >> And anyhow a runner is always allowed to implement any piece of a 
>> >> pipeline with anything that has the same "behavior", whether or not it is 
>> >> expressed as a hint or some other way (that's the whole point of Beam, 
>> >> and how we have fusion, combiner lifting, flatten unzipping, multiple 
>> >> runners, etc).
>> >>
>> >> It is probably less appropriate for reusable transforms that are expected 
>> >> to be used in more than one context. That can be up to transform/pipeline 
>> >> authors.
>> >>
>> >> And to bring it back around and connect to the above: having an API like 
>> >> GroupByKey.biggerThanMemory() as an API choice is just as fine with me as 
>> >> GroupByKey.fewKeys() and it can just be a composite that adds the 
>> >> hint/annotation to the primitive node. No need to combine API design with 
>> >> model design / no need to force users to express things in terms of the 
>> >> lowest level parts of the model.
>> >>
>> >> I was thinking about that as well. But there is a problem. The GBK is 
>> >> often part of some other transform (e.g. FileIO, but can be any other). 
>> >> We need a way to (optionally) change the behavior of a transform that is 
>> >> part of some outer composite. Therefore this should work for
>> >>
>> >> FileIO.write(...).addAnnotation(GroupByKey.HUGE)
>> >
>> > It is a very good point that library transforms should probably not be 
>> > annotated but they do need to be adjusted when executed. FWIW this is also 
>> > why windowing strategy is on PCollection and automatically propagated.
>> >
>> > But also another good example: FileIO has GBKs that are small even if the 
>> > data incoming is huge. In the analogy with windowing strategy, the library 
>> > transform has to own the re-windowing / re-sizing.
>> >
>> > So maybe PCollection.addAnnotation(SizeEstimate.HUGE) could make more 
>> > sense.
>>
>> This doesn't solve the problem, as the operation you're trying to
>> modify may be entirely internal to the composite. (Unless we have
>> annotations that get attached to inputs and "follow" through like
>> windowing, with operations that can add/remove/modify these
>> annotations.)
>>
>>
>>
>> Being able to annotate a composite and having it apply (per runner
>> semantics) to all subtransforms doesn't seem too bad. If you really
>> need to have part of the transform be executed one way, and part
>> another, that feels like you need to break apart (re-implement) the
>> transform itself.
>
>
> Are you saying that the overall solution is that the top-level pipeline 
> author will put the "huge" annotation on whatever level they have authored 
> and it'll be carried in and make all GBK's within have the hint? My main 
> point, other than the fact that size is a property of data not transforms, is 
> that most often "you" will not have authored any transform or GBK. You will 
> be stringing together pre-written transforms that are intended for use on 
> small, medium, large, and huge data. Could be fine - I suspect it ends up 
> being a bit of a whole pipeline (or very large fraction thereof) flag in 
> practice. Which might be fine, since so many pipelines are just a few stages.

I think that's less fragile than saying "inside this transform, you'll
find a Composite/Path/GBK, execute that in this way" and definitely
preferable to "I'll just make my own copy of Composite so I can stick
the annotation in the right place" but still more flexible than making
it all-or-none for your entire pipeline. Obviously the ideal is for
runners to not need these kinds of hints, but sometimes that's not the
case.

Reply via email to