Ah, that is a good point—being element-wise would make managing windows and time stamps easier for the user. Fortunately it’s a fairly easy change to make and maybe even less typing for the user. I was originally thinking side inputs and metrics would happen outside the loop, but I think you want a class and not a closure at that point for sanity.
On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw <rober...@google.com> wrote: > Ah, I see. > > Yeah, I've thought about using an iterable for the whole bundle rather > than start/finish bundle callbacks, but one of the questions is how that > would impact implicit passing of the timestamp (and other) metadata from > input elements to output elements. (You can of course attach the metadata > to any output that happens in the loop body, but it's very easy to > implicitly to break the 1:1 relationship here (e.g. by doing buffering or > otherwise modifying local state) and this would be hard to detect. (I > suppose trying to output after the loop finishes could require > something more explicit). > > > On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis <byronel...@google.com> wrote: > >> Oh, I also forgot to mention that I included element-wise collection >> operations like "map" that eliminate the need for pardo in many cases. the >> groupBy command is actually a map + groupByKey under the hood. That was to >> be more consistent with Swift's collection protocol (and is also why >> PCollection and PCollectionStream are different types... PCollection >> implements map and friends as pipeline construction operations whereas >> PCollectionStream is an actual stream) >> >> I just happened to push some "IO primitives" that uses map rather than >> pardo in a couple of places to do a true wordcount using good ol' >> Shakespeare and very very primitive GCS IO. >> >> Best, >> B >> >> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <byronel...@google.com> >> wrote: >> >>> Indeed :-) Yeah, I went back and forth on the pardo syntax quite a bit >>> before settling on where I ended up. Ultimately I decided to go with >>> something that felt more Swift-y than anything else which means that rather >>> than dealing with a single element like you do in the other SDKs you're >>> dealing with a stream of elements (which of course will often be of size >>> 1). That's a really natural paradigm in the Swift world especially with the >>> async / await structures. So when you see something like: >>> >>> pardo(name:"Read Files") { filenames,output,errors in >>> >>> for try await (filename,_,_) in filenames { >>> ... >>> output.emit(data) >>> >>> } >>> >>> filenames is the input stream and then output and errors are both output >>> streams. In theory you can have as many output streams as you like though >>> at the moment there's a compiler bug in the new type pack feature that >>> limits it to "as many as I felt like supporting". Presumably this will get >>> fixed before the official 5.9 release which will probably be in the October >>> timeframe if history is any guide) >>> >>> If you had parameterization you wanted to send that would look like >>> pardo("Parameter") { param,filenames,output,error in ... } where "param" >>> would take on the value of "Parameter." All of this is being typechecked at >>> compile time BTW. >>> >>> >>> the (filename,_,_) is a tuple spreading construct like you have in ES6 >>> and other things where "_" is Swift for "ignore." In this case >>> PCollectionStreams have an element signature of (Of,Date,Window) so you can >>> optionally extract the timestamp and the window if you want to manipulate >>> it somehow. >>> >>> That said it would also be natural to provide elementwise pardos--- that >>> would probably mean having explicit type signatures in the closure. I had >>> that at one point, but it felt less natural the more I used it. I'm also >>> slowly working towards adding a more "traditional" DoFn implementation >>> approach where you implement the DoFn as an object type. In that case it >>> would be very very easy to support both by having a default stream >>> implementation call the equivalent of processElement. To make that >>> performant I need to implement an @DoFn macro and I just haven't gotten to >>> it yet. >>> >>> It's a bit more work and I've been prioritizing implementing composite >>> and external transforms for the reasons you suggest. :-) I've got the >>> basics of a composite transform (there's an equivalent wordcount example) >>> and am hooking it into the pipeline generation, which should also give me >>> everything I need to successfully hook in external transforms as well. That >>> will give me the jump on IOs as you say. I can also treat the pipeline >>> itself as a composite transform which lets me get rid of the Pipeline { >>> pipeline in ... } and just instead have things attach themselves to the >>> pipeline implicitly. >>> >>> That said, there are some interesting IO possibilities that would be >>> Swift native. In particularly, I've been looking at the native Swift >>> binding for DuckDB (which is C++ based). DuckDB is SQL based but not >>> distributed in the same was as, say, Beam SQL... but it would allow for SQL >>> statements on individual files with projection pushdown supported for >>> things like Parquet which could have some cool and performant data lake >>> applications. I'll probably do a couple of the simpler IOs as >>> well---there's a Swift AWS SDK binding that's pretty good that would give >>> me S3 and there's a Cloud auth library as well that makes it pretty easy to >>> work with GCS. >>> >>> In any case, I'm updating the branch as I find a minute here and there. >>> >>> Best, >>> B >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Neat. >>>> >>>> Nothing like writing and SDK to actually understand how the FnAPI works >>>> :). I like the use of groupBy. I have to admit I'm a bit mystified by the >>>> syntax for parDo (I don't know swift at all which is probably tripping me >>>> up). The addition of external (cross-language) transforms could let you >>>> steal everything (e.g. IOs) pretty quickly from other SDKs. >>>> >>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user < >>>> user@beam.apache.org> wrote: >>>> >>>>> For everyone who is interested, here's the draft PR: >>>>> >>>>> https://github.com/apache/beam/pull/28062 >>>>> >>>>> I haven't had a chance to test it on my M1 machine yet though (there's >>>>> a good chance there are a few places that need to properly address >>>>> endianness. Specifically timestamps in windowed values and length in >>>>> iterable coders as those both use specifically bigendian representations) >>>>> >>>>> >>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <byronel...@google.com> >>>>> wrote: >>>>> >>>>>> Thanks Cham, >>>>>> >>>>>> Definitely happy to open a draft PR so folks can comment---there's >>>>>> not as much code as it looks like since most of the LOC is just generated >>>>>> protobuf. As for the support, I definitely want to add external >>>>>> transforms >>>>>> and may actually add that support before adding the ability to make >>>>>> composites in the language itself. With the way the SDK is laid out >>>>>> adding >>>>>> composites to the pipeline graph is a separate operation than defining a >>>>>> composite. >>>>>> >>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath < >>>>>> chamik...@google.com> wrote: >>>>>> >>>>>>> Thanks Byron. This sounds great. I wonder if there is interest in >>>>>>> Swift SDK from folks currently subscribed to the +user >>>>>>> <user@beam.apache.org> list. >>>>>>> >>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev < >>>>>>> d...@beam.apache.org> wrote: >>>>>>> >>>>>>>> Hello everyone, >>>>>>>> >>>>>>>> A couple of months ago I decided that I wanted to really understand >>>>>>>> how the Beam FnApi works and how it interacts with the Portable >>>>>>>> Runner. For >>>>>>>> me at least that usually means I need to write some code so I can see >>>>>>>> things happening in a debugger and to really prove to myself I >>>>>>>> understood what was going on I decided I couldn't use an existing SDK >>>>>>>> language to do it since there would be the temptation to read some >>>>>>>> code and >>>>>>>> convince myself that I actually understood what was going on. >>>>>>>> >>>>>>>> One thing led to another and it turns out that to get a minimal >>>>>>>> FnApi integration going you end up writing a fair bit of an SDK. So I >>>>>>>> decided to take things to a point where I had an SDK that could >>>>>>>> execute a >>>>>>>> word count example via a portable runner backend. I've now reached that >>>>>>>> point and would like to submit my prototype SDK to the list for >>>>>>>> feedback. >>>>>>>> >>>>>>>> It's currently living in a branch on my fork here: >>>>>>>> >>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift >>>>>>>> >>>>>>>> At the moment it runs via the most recent XCode Beta using Swift >>>>>>>> 5.9 on Intel Macs, but should also work using beta builds of 5.9 for >>>>>>>> Linux >>>>>>>> running on Intel hardware. I haven't had a chance to try it on ARM >>>>>>>> hardware >>>>>>>> and make sure all of the endian checks are complete. The >>>>>>>> "IntegrationTests.swift" file contains a word count example that reads >>>>>>>> some >>>>>>>> local files (as well as a missing file to exercise DLQ functionality) >>>>>>>> and >>>>>>>> output counts through two separate group by operations to get it past >>>>>>>> the >>>>>>>> "map reduce" size of pipeline. I've tested it against the Python >>>>>>>> Portable >>>>>>>> Runner. Since my goal was to learn FnApi there is no Direct Runner at >>>>>>>> this >>>>>>>> time. >>>>>>>> >>>>>>>> I've shown it to a couple of folks already and incorporated some of >>>>>>>> that feedback already (for example pardo was originally called dofn >>>>>>>> when >>>>>>>> defining pipelines). In general I've tried to make the API as >>>>>>>> "Swift-y" as >>>>>>>> possible, hence the heavy reliance on closures and while there aren't >>>>>>>> yet >>>>>>>> composite PTransforms there's the beginnings of what would be needed >>>>>>>> for a >>>>>>>> SwiftUI-like declarative API for creating them. >>>>>>>> >>>>>>>> There are of course a ton of missing bits still to be implemented, >>>>>>>> like counters, metrics, windowing, state, timers, etc. >>>>>>>> >>>>>>> >>>>>>> This should be fine and we can get the code documented without these >>>>>>> features. I think support for composites and adding an external >>>>>>> transform >>>>>>> (see, Java >>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>, >>>>>>> Python >>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>, >>>>>>> Go >>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>, >>>>>>> TypeScript >>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>) >>>>>>> to add support for multi-lang will bring in a lot of features (for >>>>>>> example, >>>>>>> I/O connectors) for free. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Any and all feedback welcome and happy to submit a PR if folks are >>>>>>>> interested, though the "Swift Way" would be to have it in its own repo >>>>>>>> so >>>>>>>> that it can easily be used from the Swift Package Manager. >>>>>>>> >>>>>>> >>>>>>> +1 for creating a PR (may be as a draft initially). Also it'll be >>>>>>> easier to comment on a PR :) >>>>>>> >>>>>>> - Cham >>>>>>> >>>>>>> [1] >>>>>>> [2] >>>>>>> [3] >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> B >>>>>>>> >>>>>>>> >>>>>>>>