Thinking about it the past few days, I think I've arrived at the conclusion that generally shared transforms should also expose their dofn classes to make accommodating this kind of pattern easier. Then with a utility decorator/class that takes a dofn, we can just modify the wrapped dofn to operate on `KV`s and leave keys alone.
It's not the most ergonomic pattern imo since it requires more consideration of PTransforms vs DoFns and which abstraction level is right for your needs, and also knowing about this `Keyed[DoFn]` decorator, but it seems unavoidable. On Sat, Oct 12, 2024 at 4:38 PM Henry Tremblay <[email protected]> wrote: > We have a similiar question/issue at my work. 2 solutions come to mind: > > 1. Wrap your inputs, transforms, etc. in functions that you can call and > the chain together > > 2. Use external libraries that a ParDo class can call. Then you can make > these external libraries flexible and testable. > > On Sat, Oct 12, 2024, 12:31 PM Joey Tran <[email protected]> > wrote: > >> Yes. But this is a hypothetical, there could also be many operations you >> might want to do with the initial data. >> >> On Sat, Oct 12, 2024, 1:47 PM Henry Tremblay <[email protected]> >> wrote: >> >>> So the only part of the pipeline you need to change is the >>> transformation in the middle, after the read for the DB and before some >>> type of write? >>> >>> On Sat, Oct 12, 2024 at 3:29 AM <[email protected]> wrote: >>> >>>> Sounds like you want a monad, heh. >>>> >>>> It would be nice if their DoFn took a generic type and you could pass >>>> it a selector func to pick out what they need. >>>> If you can access their dofn is not too complex, perhaps you just use >>>> their processElement implementation directly? >>>> >>>> eg >>>> >>>> class TheirDoFn ..{ void processElement(...){...} } >>>> >>>> class YourDoFn .. { >>>> void processElement() { >>>> TheirDoFn().processElement(...) >>>> } >>>> } >>>> >>>> Depending on what annotations they're using in their processElement >>>> func, it could be trickier or not. You could pass in a mock implementation >>>> OutputReceiver, so you can wrap the results and delegate. >>>> >>>> On Sat, 12 Oct 2024 at 08:51, XQ Hu via user <[email protected]> >>>> wrote: >>>> >>>>> This sounds like what CDC (Change Data Capture) typically does, which >>>>> usually runs as a streaming pipeline. >>>>> >>>>> On Fri, Oct 11, 2024 at 3:51 PM Joey Tran <[email protected]> >>>>> wrote: >>>>> >>>>>> Another basic pattern question for the user group. >>>>>> >>>>>> Say I have a database of records with an ID and some float property. >>>>>> Another team has written and published a transform `SquareRoot`. I want >>>>>> to >>>>>> write a pipeline that reads this database and outputs extended records >>>>>> that >>>>>> have (ID, foo_prop, squareroot(foo)_prop). How do I do this? >>>>>> >>>>>> Of course I can strip my records of their ID and then pass in the >>>>>> properties straight into `SquareRoot`, but then I have no way to link it >>>>>> back to what record the square root corresponds to. Do I just need to ask >>>>>> the other team to make their SquareRootDoFn public? Should they have >>>>>> included a `SquareRoot.WithKey()` transform that ignores a key? >>>>>> >>>>>> This feels like it'd be a common pattern but how to approach it feels >>>>>> awkward, not sure if I'm missing something obvious so thought I'd ask the >>>>>> group. >>>>>> >>>>>> Cheers, >>>>>> Joey >>>>>> >>>>>> -- >>>>>> >>>>>> Joey Tran | Staff Developer | AutoDesigner TL >>>>>> >>>>>> *he/him* >>>>>> >>>>>> [image: Schrödinger, Inc.] <https://schrodinger.com/> >>>>>> >>>>> >>> >>> -- >>> Henry Tremblay >>> Data Engineer, Best Buy >>> >>
