On Wed, Sep 20, 2023 at 10:48 AM Danny McCormick <dannymccorm...@google.com> wrote:
> > I think the process should be similar to other code/design reviews for > large contributions. I don't think you need a PMC involvement here. > > I think it does require PMC involvement to create the actual repo once we > have public consensus. I tried the flow at > https://infra.apache.org/version-control.html#create but it seems like > its PMC only. It's unclear to me if consensus has been achieved, maybe a > dedicated voting thread with implied lazy consensus would help here. > Yeah, it seems like a PMC member needs to create the repo. > > > Sure, we could definitely include things as a submodule for stuff like > testing multi-language, though I think there's actually a cleaner way just > using the Swift package manager's test facilities to access the swift sdk > repo. > > +1 on avoiding submodules. If needed we could also use multi-repo checkout > with GitHub Actions. I think my biggest question is what we'd actually be > enforcing though. In general, I'd expect the normal update flow to be > > 1) Update Beam protos and/or multi-lang components (though the set of > things that needs updated for multi-lang is unclear to me) > Regarding multi-lang, the protocol does not require consistent versioning but we may need testing to make sure things work consistently/correctly when used from a released Swift SDK. For example, Python multi-lang wrappers look for a Java version with the same version number as the Python SDK being used. > 2) Mirror those changes to the Swift SDK. > > The thing that is most likely to be forgotten is the 2nd step, and that is > hard to enforce with automation since the automation would either be on the > first step which doesn't have anything to enforce or on some sort of > schedule in the swift repo, which is less likely to be visible. I'm a > little worried we wouldn't notice breakages until release time. > > I wonder how much stuff happens outside of the proto directory that needs > to be mirrored. Could we just create scheduled automation to exactly copy > changes in the proto directory and version changes for multi-lang stuff to > the swift SDK repo? > > --------------------------------------------------------------------- > > Regardless, I'm +1 on a dedicated repo; I'd rather we take on some > organizational weirdness than push that pain to users. > > Thanks, > Danny > > On Wed, Sep 20, 2023 at 1:38 PM Byron Ellis via user <user@beam.apache.org> > wrote: > >> Sure, we could definitely include things as a submodule for stuff like >> testing multi-language, though I think there's actually a cleaner way just >> using the Swift package manager's test facilities to access the swift sdk >> repo. >> >> That would also be consistent with the user-side experience and let us >> test things like build-time integrations with multi-language as well (which >> is possible in Swift through compiler plugins) in the same way as a >> pipeline author would. You also maybe get backwards compatibility testing >> as a side effect in that case as well. >> >> >> >> >> >> >> On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath <chamik...@google.com> >> wrote: >> >>> >>> >>> >>> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I've chatted with a couple of people offline about this and my >>>> impression is that folks are generally amenable to a separate repo to match >>>> the target community? I have no idea what the next steps would be though >>>> other than guessing that there's probably some sort of PMC thing involved? >>>> Should I write something up somewhere? >>>> >>> >>> I think the process should be similar to other code/design reviews for >>> large contributions. I don't think you need a PMC involvement here. >>> >>> >>>> >>>> Best, >>>> B >>>> >>>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've been on vacation, but mostly working on getting External >>>>> Transform support going (which in turn basically requires Schema support >>>>> as >>>>> well). It also looks like macros landed in Swift 5.9 for Linux so we'll be >>>>> able to use those to do some compile-time automation. In particular, this >>>>> lets us do something similar to what Java does with ByteBuddy for >>>>> generating schema coders though it has to be ahead of time so not quite >>>>> the >>>>> same. (As far as I can tell this is a reason why macros got added to the >>>>> language in the first place---Apple's SwiftData library makes heavy use of >>>>> the feature). >>>>> >>>>> I do have one question for the group though: should the Swift SDK >>>>> distribution take on Beam community properties or Swift community >>>>> properties? Specifically, in the Swift world the Swift SDK would live in >>>>> its own repo (beam-swift for example), which allows it to be most easily >>>>> consumed and keeps the checkout size under control for users. "Releases" >>>>> in >>>>> the Swift world (much like Go) are just repo tags. The downside here is >>>>> that there's overhead in setting up the various github actions and other >>>>> CI/CD bits and bobs. >>>>> >>>>> >>> >>>> The alternative would be to keep it in the beam repo itself like it is >>>>> now, but we'd probably want to move Package.swift to the root since for >>>>> whatever reason the Swift community (much to some people's annoyance) has >>>>> chosen to have packages only really able to live at the top of a repo. >>>>> This >>>>> has less overhead from a CI/CD perspective, but lots of overhead for users >>>>> as they'd be checking out the entire Beam repo to use the SDK, which >>>>> happens a lot. >>>>> >>>>> There's a third option which is basically "do both" but honestly that >>>>> just seems like the worst of both worlds as it would require constant >>>>> syncing if we wanted to make it possible for Swift users to target >>>>> unreleased SDKs for development and testing. >>>>> >>>>> Personally, I would lean towards the former option (and would >>>>> volunteer to set up & document the various automations) as it is lighter >>>>> for the actual users of the SDK and more consistent with the community >>>>> experience they expect. The CI/CD stuff is mostly a "do it once" whereas >>>>> checking out the entire repo with many updates the user doesn't care about >>>>> is something they will be doing all the time. FWIW some of our >>>>> dependencies >>>>> also chose this route---most notably GRPC which started with the latter >>>>> approach and has moved to the former. >>>>> >>>> >>> I believe existing SDKs benefit from living in the same repo. For >>> example, it's easier to keep them consistent with any model/proto changes >>> and it's easier to manage distributions/tags. Also it's easier to keep >>> components consistent for multi-lang. If we add Swift to a separate repo, >>> we'll probably have to add tooling/scripts to keep things consistent. >>> Is it possible to create a separate repo, but also add a reference (and >>> Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests to make >>> sure that things stay consistent ? >>> >>> Thanks, >>> Cham >>> >>> >>>> >>>>> Interested to hear any feedback on the subject since I'm guessing it >>>>> probably came up with the Go SDK back in the day? >>>>> >>>>> Best, >>>>> B >>>>> >>>>> >>>>> >>>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com> >>>>> wrote: >>>>> >>>>>> After a couple of iterations (thanks rebo!) we've also gotten the >>>>>> Swift SDK working with the new Prism runner. The fact that it doesn't do >>>>>> fusion caught a couple of configuration bugs (e.g. that the grpc message >>>>>> receiver buffer should be fairly large). It would seem that at the moment >>>>>> Prism and the Flink runner have similar orders of strictness when >>>>>> interpreting the pipeline graph while the Python portable runner is far >>>>>> more forgiving. >>>>>> >>>>>> Also added support for bounded vs unbounded pcollections through the >>>>>> "type" parameter when adding a pardo. Impulse is a bounded pcollection I >>>>>> believe? >>>>>> >>>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Okay, after a brief detour through "get this working in the Flink >>>>>>> Portable Runner" I think I have something pretty workable. >>>>>>> >>>>>>> PInput and POutput can actually be structs rather than protocols, >>>>>>> which simplifies things quite a bit. It also allows us to use them with >>>>>>> property wrappers for a SwiftUI-like experience if we want when defining >>>>>>> DoFns (which is what I was originally intending to use them for). That >>>>>>> also >>>>>>> means the function signature you use for closures would match >>>>>>> full-fledged >>>>>>> DoFn definitions for the most part which is satisfying. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis <byronel...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Okay, I tried a couple of different things. >>>>>>>> >>>>>>>> Implicitly passing the timestamp and window during iteration did >>>>>>>> not go well. While physically possible it introduces an invisible side >>>>>>>> effect into loop iteration which confused me when I tried to use it >>>>>>>> and I >>>>>>>> implemented it. Also, I'm pretty sure there'd end up being some sort of >>>>>>>> race condition nightmare continuing down that path. >>>>>>>> >>>>>>>> What I decided to do instead was the following: >>>>>>>> >>>>>>>> 1. Rename the existing "pardo" functions to "pstream" and require >>>>>>>> that they always emit a window and timestamp along with their value. >>>>>>>> This >>>>>>>> eliminates the side effect but lets us keep iteration in a bundle where >>>>>>>> that might be convenient. For example, in my cheesy GCS implementation >>>>>>>> it >>>>>>>> means that I can keep an OAuth token around for the lifetime of the >>>>>>>> bundle >>>>>>>> as a local variable, which is convenient. It's a bit more typing for >>>>>>>> users >>>>>>>> of pstream, but the expectation here is that if you're using pstream >>>>>>>> functions You Know What You Are Doing and most people won't be using it >>>>>>>> directly. >>>>>>>> >>>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of them >>>>>>>> yet, but enough to test the functionality and decide I liked it) which >>>>>>>> take >>>>>>>> a function signature of (any PInput<InputType>,any >>>>>>>> POutput<OutputType>). >>>>>>>> PInput takes the (InputType,Date,Window) tuple and converts it into a >>>>>>>> struct with friendlier names. Not strictly necessary, but makes the >>>>>>>> code >>>>>>>> nicer to read I think. POutput introduces emit functions that >>>>>>>> optionally >>>>>>>> allow you to specify a timestamp and a window. If you don't for either >>>>>>>> one >>>>>>>> it will take the timestamp and/or window of the input. >>>>>>>> >>>>>>>> Trying to use that was pretty pleasant to use so I think we should >>>>>>>> continue down that path. If you'd like to see it in use, I >>>>>>>> reimplemented >>>>>>>> map() and flatMap() in terms of this new pardo functionality. >>>>>>>> >>>>>>>> Code has been pushed to the branch/PR if you're interested in >>>>>>>> taking a look. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis <byronel...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Gotcha, I think there's a fairly easy solution to link input and >>>>>>>>> output streams.... Let me try it out... might even be possible to >>>>>>>>> have both >>>>>>>>> element and stream-wise closure pardos. Definitely possible to have >>>>>>>>> that at >>>>>>>>> the DoFn level (called SerializableFn in the SDK because I want to >>>>>>>>> use @DoFn as a macro) >>>>>>>>> >>>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw < >>>>>>>>> rober...@google.com> wrote: >>>>>>>>> >>>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath < >>>>>>>>>> chamik...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw < >>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I would like to figure out a way to get the stream-y interface >>>>>>>>>>>> to work, as I think it's more natural overall. >>>>>>>>>>>> >>>>>>>>>>>> One hypothesis is that if any elements are carried over loop >>>>>>>>>>>> iterations, there will likely be some that are carried over beyond >>>>>>>>>>>> the loop >>>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to >>>>>>>>>>>> end). We >>>>>>>>>>>> could reject "plain" elements that are emitted after this point, >>>>>>>>>>>> requiring >>>>>>>>>>>> one to emit timestamp-windowed-values. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Are you assuming that the same stream (or overlapping sets of >>>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of >>>>>>>>>>> data >>>>>>>>>>> streamed here are the data that belong to the current bundle (hence >>>>>>>>>>> already >>>>>>>>>>> assigned to the current worker) so any output from the current >>>>>>>>>>> bundle >>>>>>>>>>> invocation would be a valid output of that bundle. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> Yes, the content of the stream is exactly the contents of the >>>>>>>>>> bundle. The question is how to do the input_element:output_element >>>>>>>>>> correlation for automatically propagating metadata. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Related to this, we could enforce that the only >>>>>>>>>>>> (user-accessible) way to get such a timestamped value is to start >>>>>>>>>>>> with one, >>>>>>>>>>>> e.g. a WindowedValue<T>.withValue(O) produces a WindowedValue<O> >>>>>>>>>>>> with the >>>>>>>>>>>> same metadata but a new value. Thus a user wanting to do anything >>>>>>>>>>>> "fancy" >>>>>>>>>>>> would have to explicitly request iteration over these windowed >>>>>>>>>>>> values >>>>>>>>>>>> rather than over the raw elements. (This is also forward >>>>>>>>>>>> compatible with >>>>>>>>>>>> expanding the metadata that can get attached, e.g. pane infos, and >>>>>>>>>>>> makes >>>>>>>>>>>> the right thing the easiest/most natural.) >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis < >>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Ah, that is a good point—being element-wise would make >>>>>>>>>>>>> managing windows and time stamps easier for the user. Fortunately >>>>>>>>>>>>> it’s a >>>>>>>>>>>>> fairly easy change to make and maybe even less typing for the >>>>>>>>>>>>> user. I was >>>>>>>>>>>>> originally thinking side inputs and metrics would happen outside >>>>>>>>>>>>> the loop, >>>>>>>>>>>>> but I think you want a class and not a closure at that point for >>>>>>>>>>>>> sanity. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw < >>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Ah, I see. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole >>>>>>>>>>>>>> bundle rather than start/finish bundle callbacks, but one of the >>>>>>>>>>>>>> questions >>>>>>>>>>>>>> is how that would impact implicit passing of the timestamp (and >>>>>>>>>>>>>> other) >>>>>>>>>>>>>> metadata from input elements to output elements. (You can of >>>>>>>>>>>>>> course attach >>>>>>>>>>>>>> the metadata to any output that happens in the loop body, but >>>>>>>>>>>>>> it's very >>>>>>>>>>>>>> easy to implicitly to break the 1:1 relationship here (e.g. by >>>>>>>>>>>>>> doing >>>>>>>>>>>>>> buffering or otherwise modifying local state) and this would be >>>>>>>>>>>>>> hard to >>>>>>>>>>>>>> detect. (I suppose trying to output after the loop finishes >>>>>>>>>>>>>> could require >>>>>>>>>>>>>> something more explicit). >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis < >>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise >>>>>>>>>>>>>>> collection operations like "map" that eliminate the need for >>>>>>>>>>>>>>> pardo in many >>>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey under >>>>>>>>>>>>>>> the hood. >>>>>>>>>>>>>>> That was to be more consistent with Swift's collection protocol >>>>>>>>>>>>>>> (and is >>>>>>>>>>>>>>> also why PCollection and PCollectionStream are different >>>>>>>>>>>>>>> types... >>>>>>>>>>>>>>> PCollection implements map and friends as pipeline construction >>>>>>>>>>>>>>> operations >>>>>>>>>>>>>>> whereas PCollectionStream is an actual stream) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses map >>>>>>>>>>>>>>> rather than pardo in a couple of places to do a true wordcount >>>>>>>>>>>>>>> using good >>>>>>>>>>>>>>> ol' Shakespeare and very very primitive GCS IO. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> B >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis < >>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo syntax >>>>>>>>>>>>>>>> quite a bit before settling on where I ended up. Ultimately I >>>>>>>>>>>>>>>> decided to go >>>>>>>>>>>>>>>> with something that felt more Swift-y than anything else which >>>>>>>>>>>>>>>> means that >>>>>>>>>>>>>>>> rather than dealing with a single element like you do in the >>>>>>>>>>>>>>>> other SDKs >>>>>>>>>>>>>>>> you're dealing with a stream of elements (which of course will >>>>>>>>>>>>>>>> often be of >>>>>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world >>>>>>>>>>>>>>>> especially >>>>>>>>>>>>>>>> with the async / await structures. So when you see something >>>>>>>>>>>>>>>> like: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> for try await (filename,_,_) in filenames { >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> output.emit(data) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> filenames is the input stream and then output and errors >>>>>>>>>>>>>>>> are both output streams. In theory you can have as many output >>>>>>>>>>>>>>>> streams as >>>>>>>>>>>>>>>> you like though at the moment there's a compiler bug in the >>>>>>>>>>>>>>>> new type pack >>>>>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". >>>>>>>>>>>>>>>> Presumably >>>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which will >>>>>>>>>>>>>>>> probably be >>>>>>>>>>>>>>>> in the October timeframe if history is any guide) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If you had parameterization you wanted to send that would >>>>>>>>>>>>>>>> look like pardo("Parameter") { param,filenames,output,error in >>>>>>>>>>>>>>>> ... } where >>>>>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is >>>>>>>>>>>>>>>> being >>>>>>>>>>>>>>>> typechecked at compile time BTW. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you >>>>>>>>>>>>>>>> have in ES6 and other things where "_" is Swift for "ignore." >>>>>>>>>>>>>>>> In this case >>>>>>>>>>>>>>>> PCollectionStreams have an element signature of >>>>>>>>>>>>>>>> (Of,Date,Window) so you can >>>>>>>>>>>>>>>> optionally extract the timestamp and the window if you want to >>>>>>>>>>>>>>>> manipulate >>>>>>>>>>>>>>>> it somehow. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That said it would also be natural to provide elementwise >>>>>>>>>>>>>>>> pardos--- that would probably mean having explicit type >>>>>>>>>>>>>>>> signatures in the >>>>>>>>>>>>>>>> closure. I had that at one point, but it felt less natural the >>>>>>>>>>>>>>>> more I used >>>>>>>>>>>>>>>> it. I'm also slowly working towards adding a more >>>>>>>>>>>>>>>> "traditional" DoFn >>>>>>>>>>>>>>>> implementation approach where you implement the DoFn as an >>>>>>>>>>>>>>>> object type. In >>>>>>>>>>>>>>>> that case it would be very very easy to support both by having >>>>>>>>>>>>>>>> a default >>>>>>>>>>>>>>>> stream implementation call the equivalent of processElement. >>>>>>>>>>>>>>>> To make that >>>>>>>>>>>>>>>> performant I need to implement an @DoFn macro and I just >>>>>>>>>>>>>>>> haven't gotten to >>>>>>>>>>>>>>>> it yet. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing >>>>>>>>>>>>>>>> implementing composite and external transforms for the reasons >>>>>>>>>>>>>>>> you suggest. >>>>>>>>>>>>>>>> :-) I've got the basics of a composite transform (there's an >>>>>>>>>>>>>>>> equivalent >>>>>>>>>>>>>>>> wordcount example) and am hooking it into the pipeline >>>>>>>>>>>>>>>> generation, which >>>>>>>>>>>>>>>> should also give me everything I need to successfully hook in >>>>>>>>>>>>>>>> external >>>>>>>>>>>>>>>> transforms as well. That will give me the jump on IOs as you >>>>>>>>>>>>>>>> say. I can >>>>>>>>>>>>>>>> also treat the pipeline itself as a composite transform which >>>>>>>>>>>>>>>> lets me get >>>>>>>>>>>>>>>> rid of the Pipeline { pipeline in ... } and just instead have >>>>>>>>>>>>>>>> things attach >>>>>>>>>>>>>>>> themselves to the pipeline implicitly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> That said, there are some interesting IO possibilities that >>>>>>>>>>>>>>>> would be Swift native. In particularly, I've been looking at >>>>>>>>>>>>>>>> the native >>>>>>>>>>>>>>>> Swift binding for DuckDB (which is C++ based). DuckDB is SQL >>>>>>>>>>>>>>>> based but not >>>>>>>>>>>>>>>> distributed in the same was as, say, Beam SQL... but it would >>>>>>>>>>>>>>>> allow for SQL >>>>>>>>>>>>>>>> statements on individual files with projection pushdown >>>>>>>>>>>>>>>> supported for >>>>>>>>>>>>>>>> things like Parquet which could have some cool and performant >>>>>>>>>>>>>>>> data lake >>>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs as >>>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that >>>>>>>>>>>>>>>> would give >>>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes it >>>>>>>>>>>>>>>> pretty easy to >>>>>>>>>>>>>>>> work with GCS. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute >>>>>>>>>>>>>>>> here and there. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> B >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw < >>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Neat. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how >>>>>>>>>>>>>>>>> the FnAPI works :). I like the use of groupBy. I have to >>>>>>>>>>>>>>>>> admit I'm a bit >>>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at all >>>>>>>>>>>>>>>>> which is >>>>>>>>>>>>>>>>> probably tripping me up). The addition of external >>>>>>>>>>>>>>>>> (cross-language) >>>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) pretty >>>>>>>>>>>>>>>>> quickly from >>>>>>>>>>>>>>>>> other SDKs. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user < >>>>>>>>>>>>>>>>> user@beam.apache.org> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet >>>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that >>>>>>>>>>>>>>>>>> need to properly >>>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed >>>>>>>>>>>>>>>>>> values and length >>>>>>>>>>>>>>>>>> in iterable coders as those both use specifically bigendian >>>>>>>>>>>>>>>>>> representations) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis < >>>>>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks Cham, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can >>>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like since >>>>>>>>>>>>>>>>>>> most of the LOC >>>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I >>>>>>>>>>>>>>>>>>> definitely want to add >>>>>>>>>>>>>>>>>>> external transforms and may actually add that support >>>>>>>>>>>>>>>>>>> before adding the >>>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With the >>>>>>>>>>>>>>>>>>> way the SDK is >>>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a >>>>>>>>>>>>>>>>>>> separate operation >>>>>>>>>>>>>>>>>>> than defining a composite. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath < >>>>>>>>>>>>>>>>>>> chamik...@google.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there is >>>>>>>>>>>>>>>>>>>> interest in Swift SDK from folks currently subscribed to >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev < >>>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hello everyone, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to >>>>>>>>>>>>>>>>>>>>> really understand how the Beam FnApi works and how it >>>>>>>>>>>>>>>>>>>>> interacts with the >>>>>>>>>>>>>>>>>>>>> Portable Runner. For me at least that usually means I >>>>>>>>>>>>>>>>>>>>> need to write some >>>>>>>>>>>>>>>>>>>>> code so I can see things happening in a debugger and to >>>>>>>>>>>>>>>>>>>>> really prove to >>>>>>>>>>>>>>>>>>>>> myself I understood what was going on I decided I >>>>>>>>>>>>>>>>>>>>> couldn't use an existing >>>>>>>>>>>>>>>>>>>>> SDK language to do it since there would be the temptation >>>>>>>>>>>>>>>>>>>>> to read some code >>>>>>>>>>>>>>>>>>>>> and convince myself that I actually understood what was >>>>>>>>>>>>>>>>>>>>> going on. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to get >>>>>>>>>>>>>>>>>>>>> a minimal FnApi integration going you end up writing a >>>>>>>>>>>>>>>>>>>>> fair bit of an SDK. >>>>>>>>>>>>>>>>>>>>> So I decided to take things to a point where I had an SDK >>>>>>>>>>>>>>>>>>>>> that could >>>>>>>>>>>>>>>>>>>>> execute a word count example via a portable runner >>>>>>>>>>>>>>>>>>>>> backend. I've now >>>>>>>>>>>>>>>>>>>>> reached that point and would like to submit my prototype >>>>>>>>>>>>>>>>>>>>> SDK to the list >>>>>>>>>>>>>>>>>>>>> for feedback. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode Beta >>>>>>>>>>>>>>>>>>>>> using Swift 5.9 on Intel Macs, but should also work using >>>>>>>>>>>>>>>>>>>>> beta builds of >>>>>>>>>>>>>>>>>>>>> 5.9 for Linux running on Intel hardware. I haven't had a >>>>>>>>>>>>>>>>>>>>> chance to try it >>>>>>>>>>>>>>>>>>>>> on ARM hardware and make sure all of the endian checks >>>>>>>>>>>>>>>>>>>>> are complete. The >>>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count >>>>>>>>>>>>>>>>>>>>> example that reads some >>>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ >>>>>>>>>>>>>>>>>>>>> functionality) and >>>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations to >>>>>>>>>>>>>>>>>>>>> get it past the >>>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against the >>>>>>>>>>>>>>>>>>>>> Python Portable >>>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no >>>>>>>>>>>>>>>>>>>>> Direct Runner at this >>>>>>>>>>>>>>>>>>>>> time. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and >>>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for example >>>>>>>>>>>>>>>>>>>>> pardo was >>>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In >>>>>>>>>>>>>>>>>>>>> general I've tried to >>>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy >>>>>>>>>>>>>>>>>>>>> reliance on closures >>>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms there's >>>>>>>>>>>>>>>>>>>>> the beginnings of >>>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative API >>>>>>>>>>>>>>>>>>>>> for creating them. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to be >>>>>>>>>>>>>>>>>>>>> implemented, like counters, metrics, windowing, state, >>>>>>>>>>>>>>>>>>>>> timers, etc. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This should be fine and we can get the code documented >>>>>>>>>>>>>>>>>>>> without these features. I think support for composites and >>>>>>>>>>>>>>>>>>>> adding an >>>>>>>>>>>>>>>>>>>> external transform (see, Java >>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>, >>>>>>>>>>>>>>>>>>>> Python >>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>, >>>>>>>>>>>>>>>>>>>> Go >>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>, >>>>>>>>>>>>>>>>>>>> TypeScript >>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>) >>>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of >>>>>>>>>>>>>>>>>>>> features (for example, >>>>>>>>>>>>>>>>>>>> I/O connectors) for free. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a PR >>>>>>>>>>>>>>>>>>>>> if folks are interested, though the "Swift Way" would be >>>>>>>>>>>>>>>>>>>>> to have it in its >>>>>>>>>>>>>>>>>>>>> own repo so that it can easily be used from the Swift >>>>>>>>>>>>>>>>>>>>> Package Manager. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially). >>>>>>>>>>>>>>>>>>>> Also it'll be easier to comment on a PR :) >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Cham >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>> B >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>