On Thu, Jan 23, 2025 at 10:39 AM Kenneth Knowles <k...@apache.org> wrote: > > On Tue, Jan 21, 2025 at 7:51 PM Robert Bradshaw via dev <dev@beam.apache.org> > wrote: >> >> On Tue, Jan 21, 2025 at 7:26 AM Kenneth Knowles <k...@apache.org> wrote: >> > >> > On Tue, Jan 21, 2025 at 2:35 AM Jan Lukavský <je...@seznam.cz> wrote: >> >> >> >> >> >> On 1/20/25 18:18, Kenneth Knowles wrote: >> >> >> >> This all sounds good. I will add my standard comment that this hint is a >> >> property of the data, not the pipeline logic. So it is a different type >> >> of hint than key invariance and fanout ratio). >> >> >> >> This is not a problem for the proposed approach, in my opinion. >> >> Obviously, almost always there will be some pipeline code that is written >> >> specifically for the data in mind. >> >> >> >> A couple other examples of "hints" that you can keep in mind are >> >> Combine.withHotkeyFanout, Redistribute (both variants), and >> >> GroupByKey.fewKeys. These were chosen to be expressed as transforms, even >> >> though they are more like hints. I bring up these examples to say that we >> >> don't have to be too pedantic here, because it is already too late :-) >> >> >> >> And anyhow a runner is always allowed to implement any piece of a >> >> pipeline with anything that has the same "behavior", whether or not it is >> >> expressed as a hint or some other way (that's the whole point of Beam, >> >> and how we have fusion, combiner lifting, flatten unzipping, multiple >> >> runners, etc). >> >> >> >> It is probably less appropriate for reusable transforms that are expected >> >> to be used in more than one context. That can be up to transform/pipeline >> >> authors. >> >> >> >> And to bring it back around and connect to the above: having an API like >> >> GroupByKey.biggerThanMemory() as an API choice is just as fine with me as >> >> GroupByKey.fewKeys() and it can just be a composite that adds the >> >> hint/annotation to the primitive node. No need to combine API design with >> >> model design / no need to force users to express things in terms of the >> >> lowest level parts of the model. >> >> >> >> I was thinking about that as well. But there is a problem. The GBK is >> >> often part of some other transform (e.g. FileIO, but can be any other). >> >> We need a way to (optionally) change the behavior of a transform that is >> >> part of some outer composite. Therefore this should work for >> >> >> >> FileIO.write(...).addAnnotation(GroupByKey.HUGE) >> > >> > It is a very good point that library transforms should probably not be >> > annotated but they do need to be adjusted when executed. FWIW this is also >> > why windowing strategy is on PCollection and automatically propagated. >> > >> > But also another good example: FileIO has GBKs that are small even if the >> > data incoming is huge. In the analogy with windowing strategy, the library >> > transform has to own the re-windowing / re-sizing. >> > >> > So maybe PCollection.addAnnotation(SizeEstimate.HUGE) could make more >> > sense. >> >> This doesn't solve the problem, as the operation you're trying to >> modify may be entirely internal to the composite. (Unless we have >> annotations that get attached to inputs and "follow" through like >> windowing, with operations that can add/remove/modify these >> annotations.) >> >> >> >> Being able to annotate a composite and having it apply (per runner >> semantics) to all subtransforms doesn't seem too bad. If you really >> need to have part of the transform be executed one way, and part >> another, that feels like you need to break apart (re-implement) the >> transform itself. > > > Are you saying that the overall solution is that the top-level pipeline > author will put the "huge" annotation on whatever level they have authored > and it'll be carried in and make all GBK's within have the hint? My main > point, other than the fact that size is a property of data not transforms, is > that most often "you" will not have authored any transform or GBK. You will > be stringing together pre-written transforms that are intended for use on > small, medium, large, and huge data. Could be fine - I suspect it ends up > being a bit of a whole pipeline (or very large fraction thereof) flag in > practice. Which might be fine, since so many pipelines are just a few stages.
I think that's less fragile than saying "inside this transform, you'll find a Composite/Path/GBK, execute that in this way" and definitely preferable to "I'll just make my own copy of Composite so I can stick the annotation in the right place" but still more flexible than making it all-or-none for your entire pipeline. Obviously the ideal is for runners to not need these kinds of hints, but sometimes that's not the case.