Do we anticipate any short-term changes to the release process to start releasing switft SDK artifacts or we can hold that off for a certain time while SDK is in active development?
On Mon, Sep 25, 2023 at 9:56 AM Robert Burke <rob...@frantil.com> wrote: > I lost this thread for a bit. I'm glad Prism showed some use while it's > doing unfused stages! > > I have no objections to a separate repo, and in a "Beam Go SDK V3" world > that's what I'd want as well, because it works better for the Go usage > patterns and is more natural for the tooling. And it would be a cleaner way > to do a full overhaul of the user API given the way Go has evolved since > it's initial design, and our own experience with it. But that's a very > different topic for when I have a real proposal around it. > > I do see the clean thread Kenn started, but since i have no objections, > I'll leave it to silent consensus. > > I agree that copying/building the protos isn't a burden, since that's > entirely what protos are for. We're already treating them as properly > stable and not making breaking proto, so compatibility is maintained by > normal proto behavior. > > Robert Burke > Beam Go Busybody > > On Thu, Sep 21, 2023, 9:52 AM Byron Ellis via user <user@beam.apache.org> > wrote: > >> Also, seems like we're getting something like a consensus? One the repo >> exists I'm happy to do the slog work of moving everything around (though >> I'm not a committer so somebody else actually has to do the pushes). We can >> do that in chunks to make life easier on people and I'm not super concerned >> with losing the commit history on my current branch >> >> On Wed, Sep 20, 2023 at 11:10 AM Byron Ellis <byronel...@google.com> >> wrote: >> >>> I actually don't think we'll need any of the multi-repo github actions, >>> Swift packages are basically 1:1 with repos so the build process will >>> actually do all the checkouts. What we'd do is put a test package in the >>> sdks/swift, which works fine since it doesn't ever get used as a dependency >>> that depends on the swift SDKs with the appropriate dependencies we want to >>> make sure we're testing. This should also catch breaking changes to the >>> protos (which in theory proto is helping us avoid). >>> >>> Syncing the protos hasn't been a huge deal and it's already scripted so >>> definitely easy to automate. I also don't think we would want to do that >>> all the time anyway as that would require pipeline authors to install >>> protoc for something that doesn't happen all that often. We can take care >>> of that for users. >>> >>> >>> On Wed, Sep 20, 2023 at 10:48 AM Danny McCormick < >>> dannymccorm...@google.com> wrote: >>> >>>> > I think the process should be similar to other code/design reviews >>>> for large contributions. I don't think you need a PMC involvement here. >>>> >>>> I think it does require PMC involvement to create the actual repo once >>>> we have public consensus. I tried the flow at >>>> https://infra.apache.org/version-control.html#create but it seems like >>>> its PMC only. It's unclear to me if consensus has been achieved, maybe a >>>> dedicated voting thread with implied lazy consensus would help here. >>>> >>>> > Sure, we could definitely include things as a submodule for stuff >>>> like testing multi-language, though I think there's actually a cleaner way >>>> just using the Swift package manager's test facilities to access the swift >>>> sdk repo. >>>> >>>> +1 on avoiding submodules. If needed we could also use multi-repo >>>> checkout with GitHub Actions. I think my biggest question is what we'd >>>> actually be enforcing though. In general, I'd expect the normal update flow >>>> to be >>>> >>>> 1) Update Beam protos and/or multi-lang components (though the set of >>>> things that needs updated for multi-lang is unclear to me) >>>> 2) Mirror those changes to the Swift SDK. >>>> >>>> The thing that is most likely to be forgotten is the 2nd step, and that >>>> is hard to enforce with automation since the automation would either be on >>>> the first step which doesn't have anything to enforce or on some sort of >>>> schedule in the swift repo, which is less likely to be visible. I'm a >>>> little worried we wouldn't notice breakages until release time. >>>> >>>> I wonder how much stuff happens outside of the proto directory that >>>> needs to be mirrored. Could we just create scheduled automation to exactly >>>> copy changes in the proto directory and version changes for multi-lang >>>> stuff to the swift SDK repo? >>>> >>>> --------------------------------------------------------------------- >>>> >>>> Regardless, I'm +1 on a dedicated repo; I'd rather we take on some >>>> organizational weirdness than push that pain to users. >>>> >>>> Thanks, >>>> Danny >>>> >>>> On Wed, Sep 20, 2023 at 1:38 PM Byron Ellis via user < >>>> user@beam.apache.org> wrote: >>>> >>>>> Sure, we could definitely include things as a submodule for stuff like >>>>> testing multi-language, though I think there's actually a cleaner way just >>>>> using the Swift package manager's test facilities to access the swift sdk >>>>> repo. >>>>> >>>>> That would also be consistent with the user-side experience and let >>>>> us test things like build-time integrations with multi-language as well >>>>> (which is possible in Swift through compiler plugins) in the same way as a >>>>> pipeline author would. You also maybe get backwards compatibility testing >>>>> as a side effect in that case as well. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath < >>>>> chamik...@google.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I've chatted with a couple of people offline about this and my >>>>>>> impression is that folks are generally amenable to a separate repo to >>>>>>> match >>>>>>> the target community? I have no idea what the next steps would be though >>>>>>> other than guessing that there's probably some sort of PMC thing >>>>>>> involved? >>>>>>> Should I write something up somewhere? >>>>>>> >>>>>> >>>>>> I think the process should be similar to other code/design reviews >>>>>> for large contributions. I don't think you need a PMC involvement here. >>>>>> >>>>>> >>>>>>> >>>>>>> Best, >>>>>>> B >>>>>>> >>>>>>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I've been on vacation, but mostly working on getting External >>>>>>>> Transform support going (which in turn basically requires Schema >>>>>>>> support as >>>>>>>> well). It also looks like macros landed in Swift 5.9 for Linux so >>>>>>>> we'll be >>>>>>>> able to use those to do some compile-time automation. In particular, >>>>>>>> this >>>>>>>> lets us do something similar to what Java does with ByteBuddy for >>>>>>>> generating schema coders though it has to be ahead of time so not >>>>>>>> quite the >>>>>>>> same. (As far as I can tell this is a reason why macros got added to >>>>>>>> the >>>>>>>> language in the first place---Apple's SwiftData library makes heavy >>>>>>>> use of >>>>>>>> the feature). >>>>>>>> >>>>>>>> I do have one question for the group though: should the Swift SDK >>>>>>>> distribution take on Beam community properties or Swift community >>>>>>>> properties? Specifically, in the Swift world the Swift SDK would live >>>>>>>> in >>>>>>>> its own repo (beam-swift for example), which allows it to be most >>>>>>>> easily >>>>>>>> consumed and keeps the checkout size under control for users. >>>>>>>> "Releases" in >>>>>>>> the Swift world (much like Go) are just repo tags. The downside here is >>>>>>>> that there's overhead in setting up the various github actions and >>>>>>>> other >>>>>>>> CI/CD bits and bobs. >>>>>>>> >>>>>>>> >>>>>> >>>>>>> The alternative would be to keep it in the beam repo itself like it >>>>>>>> is now, but we'd probably want to move Package.swift to the root since >>>>>>>> for >>>>>>>> whatever reason the Swift community (much to some people's annoyance) >>>>>>>> has >>>>>>>> chosen to have packages only really able to live at the top of a repo. >>>>>>>> This >>>>>>>> has less overhead from a CI/CD perspective, but lots of overhead for >>>>>>>> users >>>>>>>> as they'd be checking out the entire Beam repo to use the SDK, which >>>>>>>> happens a lot. >>>>>>>> >>>>>>>> There's a third option which is basically "do both" but honestly >>>>>>>> that just seems like the worst of both worlds as it would require >>>>>>>> constant >>>>>>>> syncing if we wanted to make it possible for Swift users to target >>>>>>>> unreleased SDKs for development and testing. >>>>>>>> >>>>>>>> Personally, I would lean towards the former option (and would >>>>>>>> volunteer to set up & document the various automations) as it is >>>>>>>> lighter >>>>>>>> for the actual users of the SDK and more consistent with the community >>>>>>>> experience they expect. The CI/CD stuff is mostly a "do it once" >>>>>>>> whereas >>>>>>>> checking out the entire repo with many updates the user doesn't care >>>>>>>> about >>>>>>>> is something they will be doing all the time. FWIW some of our >>>>>>>> dependencies >>>>>>>> also chose this route---most notably GRPC which started with the latter >>>>>>>> approach and has moved to the former. >>>>>>>> >>>>>>> >>>>>> I believe existing SDKs benefit from living in the same repo. For >>>>>> example, it's easier to keep them consistent with any model/proto changes >>>>>> and it's easier to manage distributions/tags. Also it's easier to keep >>>>>> components consistent for multi-lang. If we add Swift to a separate repo, >>>>>> we'll probably have to add tooling/scripts to keep things consistent. >>>>>> Is it possible to create a separate repo, but also add a reference >>>>>> (and Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests >>>>>> to >>>>>> make sure that things stay consistent ? >>>>>> >>>>>> Thanks, >>>>>> Cham >>>>>> >>>>>> >>>>>>> >>>>>>>> Interested to hear any feedback on the subject since I'm guessing >>>>>>>> it probably came up with the Go SDK back in the day? >>>>>>>> >>>>>>>> Best, >>>>>>>> B >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> After a couple of iterations (thanks rebo!) we've also gotten the >>>>>>>>> Swift SDK working with the new Prism runner. The fact that it doesn't >>>>>>>>> do >>>>>>>>> fusion caught a couple of configuration bugs (e.g. that the grpc >>>>>>>>> message >>>>>>>>> receiver buffer should be fairly large). It would seem that at the >>>>>>>>> moment >>>>>>>>> Prism and the Flink runner have similar orders of strictness when >>>>>>>>> interpreting the pipeline graph while the Python portable runner is >>>>>>>>> far >>>>>>>>> more forgiving. >>>>>>>>> >>>>>>>>> Also added support for bounded vs unbounded pcollections through >>>>>>>>> the "type" parameter when adding a pardo. Impulse is a bounded >>>>>>>>> pcollection >>>>>>>>> I believe? >>>>>>>>> >>>>>>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Okay, after a brief detour through "get this working in the Flink >>>>>>>>>> Portable Runner" I think I have something pretty workable. >>>>>>>>>> >>>>>>>>>> PInput and POutput can actually be structs rather than protocols, >>>>>>>>>> which simplifies things quite a bit. It also allows us to use them >>>>>>>>>> with >>>>>>>>>> property wrappers for a SwiftUI-like experience if we want when >>>>>>>>>> defining >>>>>>>>>> DoFns (which is what I was originally intending to use them for). >>>>>>>>>> That also >>>>>>>>>> means the function signature you use for closures would match >>>>>>>>>> full-fledged >>>>>>>>>> DoFn definitions for the most part which is satisfying. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis < >>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> Okay, I tried a couple of different things. >>>>>>>>>>> >>>>>>>>>>> Implicitly passing the timestamp and window during iteration did >>>>>>>>>>> not go well. While physically possible it introduces an invisible >>>>>>>>>>> side >>>>>>>>>>> effect into loop iteration which confused me when I tried to use it >>>>>>>>>>> and I >>>>>>>>>>> implemented it. Also, I'm pretty sure there'd end up being some >>>>>>>>>>> sort of >>>>>>>>>>> race condition nightmare continuing down that path. >>>>>>>>>>> >>>>>>>>>>> What I decided to do instead was the following: >>>>>>>>>>> >>>>>>>>>>> 1. Rename the existing "pardo" functions to "pstream" and >>>>>>>>>>> require that they always emit a window and timestamp along with >>>>>>>>>>> their >>>>>>>>>>> value. This eliminates the side effect but lets us keep iteration >>>>>>>>>>> in a >>>>>>>>>>> bundle where that might be convenient. For example, in my cheesy GCS >>>>>>>>>>> implementation it means that I can keep an OAuth token around for >>>>>>>>>>> the >>>>>>>>>>> lifetime of the bundle as a local variable, which is convenient. >>>>>>>>>>> It's a bit >>>>>>>>>>> more typing for users of pstream, but the expectation here is that >>>>>>>>>>> if >>>>>>>>>>> you're using pstream functions You Know What You Are Doing and most >>>>>>>>>>> people >>>>>>>>>>> won't be using it directly. >>>>>>>>>>> >>>>>>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of >>>>>>>>>>> them yet, but enough to test the functionality and decide I liked >>>>>>>>>>> it) which >>>>>>>>>>> take a function signature of (any PInput<InputType>,any >>>>>>>>>>> POutput<OutputType>). PInput takes the (InputType,Date,Window) >>>>>>>>>>> tuple and >>>>>>>>>>> converts it into a struct with friendlier names. Not strictly >>>>>>>>>>> necessary, >>>>>>>>>>> but makes the code nicer to read I think. POutput introduces emit >>>>>>>>>>> functions >>>>>>>>>>> that optionally allow you to specify a timestamp and a window. If >>>>>>>>>>> you don't >>>>>>>>>>> for either one it will take the timestamp and/or window of the >>>>>>>>>>> input. >>>>>>>>>>> >>>>>>>>>>> Trying to use that was pretty pleasant to use so I think we >>>>>>>>>>> should continue down that path. If you'd like to see it in use, I >>>>>>>>>>> reimplemented map() and flatMap() in terms of this new pardo >>>>>>>>>>> functionality. >>>>>>>>>>> >>>>>>>>>>> Code has been pushed to the branch/PR if you're interested in >>>>>>>>>>> taking a look. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis < >>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Gotcha, I think there's a fairly easy solution to link input >>>>>>>>>>>> and output streams.... Let me try it out... might even be possible >>>>>>>>>>>> to have >>>>>>>>>>>> both element and stream-wise closure pardos. Definitely possible >>>>>>>>>>>> to have >>>>>>>>>>>> that at the DoFn level (called SerializableFn in the SDK because I >>>>>>>>>>>> want to >>>>>>>>>>>> use @DoFn as a macro) >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw < >>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath < >>>>>>>>>>>>> chamik...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw < >>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I would like to figure out a way to get the stream-y >>>>>>>>>>>>>>> interface to work, as I think it's more natural overall. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> One hypothesis is that if any elements are carried over loop >>>>>>>>>>>>>>> iterations, there will likely be some that are carried over >>>>>>>>>>>>>>> beyond the loop >>>>>>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to >>>>>>>>>>>>>>> end). We >>>>>>>>>>>>>>> could reject "plain" elements that are emitted after this >>>>>>>>>>>>>>> point, requiring >>>>>>>>>>>>>>> one to emit timestamp-windowed-values. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are you assuming that the same stream (or overlapping sets of >>>>>>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of >>>>>>>>>>>>>> data >>>>>>>>>>>>>> streamed here are the data that belong to the current bundle >>>>>>>>>>>>>> (hence already >>>>>>>>>>>>>> assigned to the current worker) so any output from the current >>>>>>>>>>>>>> bundle >>>>>>>>>>>>>> invocation would be a valid output of that bundle. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> Yes, the content of the stream is exactly the contents of the >>>>>>>>>>>>> bundle. The question is how to do the input_element:output_element >>>>>>>>>>>>> correlation for automatically propagating metadata. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> Related to this, we could enforce that the only >>>>>>>>>>>>>>> (user-accessible) way to get such a timestamped value is to >>>>>>>>>>>>>>> start with one, >>>>>>>>>>>>>>> e.g. a WindowedValue<T>.withValue(O) produces a >>>>>>>>>>>>>>> WindowedValue<O> with the >>>>>>>>>>>>>>> same metadata but a new value. Thus a user wanting to do >>>>>>>>>>>>>>> anything "fancy" >>>>>>>>>>>>>>> would have to explicitly request iteration over these windowed >>>>>>>>>>>>>>> values >>>>>>>>>>>>>>> rather than over the raw elements. (This is also forward >>>>>>>>>>>>>>> compatible with >>>>>>>>>>>>>>> expanding the metadata that can get attached, e.g. pane infos, >>>>>>>>>>>>>>> and makes >>>>>>>>>>>>>>> the right thing the easiest/most natural.) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis < >>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ah, that is a good point—being element-wise would make >>>>>>>>>>>>>>>> managing windows and time stamps easier for the user. >>>>>>>>>>>>>>>> Fortunately it’s a >>>>>>>>>>>>>>>> fairly easy change to make and maybe even less typing for the >>>>>>>>>>>>>>>> user. I was >>>>>>>>>>>>>>>> originally thinking side inputs and metrics would happen >>>>>>>>>>>>>>>> outside the loop, >>>>>>>>>>>>>>>> but I think you want a class and not a closure at that point >>>>>>>>>>>>>>>> for sanity. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw < >>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ah, I see. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole >>>>>>>>>>>>>>>>> bundle rather than start/finish bundle callbacks, but one of >>>>>>>>>>>>>>>>> the questions >>>>>>>>>>>>>>>>> is how that would impact implicit passing of the timestamp >>>>>>>>>>>>>>>>> (and other) >>>>>>>>>>>>>>>>> metadata from input elements to output elements. (You can of >>>>>>>>>>>>>>>>> course attach >>>>>>>>>>>>>>>>> the metadata to any output that happens in the loop body, but >>>>>>>>>>>>>>>>> it's very >>>>>>>>>>>>>>>>> easy to implicitly to break the 1:1 relationship here (e.g. >>>>>>>>>>>>>>>>> by doing >>>>>>>>>>>>>>>>> buffering or otherwise modifying local state) and this would >>>>>>>>>>>>>>>>> be hard to >>>>>>>>>>>>>>>>> detect. (I suppose trying to output after the loop finishes >>>>>>>>>>>>>>>>> could require >>>>>>>>>>>>>>>>> something more explicit). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis < >>>>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise >>>>>>>>>>>>>>>>>> collection operations like "map" that eliminate the need for >>>>>>>>>>>>>>>>>> pardo in many >>>>>>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey >>>>>>>>>>>>>>>>>> under the hood. >>>>>>>>>>>>>>>>>> That was to be more consistent with Swift's collection >>>>>>>>>>>>>>>>>> protocol (and is >>>>>>>>>>>>>>>>>> also why PCollection and PCollectionStream are different >>>>>>>>>>>>>>>>>> types... >>>>>>>>>>>>>>>>>> PCollection implements map and friends as pipeline >>>>>>>>>>>>>>>>>> construction operations >>>>>>>>>>>>>>>>>> whereas PCollectionStream is an actual stream) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses >>>>>>>>>>>>>>>>>> map rather than pardo in a couple of places to do a true >>>>>>>>>>>>>>>>>> wordcount using >>>>>>>>>>>>>>>>>> good ol' Shakespeare and very very primitive GCS IO. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> B >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis < >>>>>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo >>>>>>>>>>>>>>>>>>> syntax quite a bit before settling on where I ended up. >>>>>>>>>>>>>>>>>>> Ultimately I >>>>>>>>>>>>>>>>>>> decided to go with something that felt more Swift-y than >>>>>>>>>>>>>>>>>>> anything else >>>>>>>>>>>>>>>>>>> which means that rather than dealing with a single element >>>>>>>>>>>>>>>>>>> like you do in >>>>>>>>>>>>>>>>>>> the other SDKs you're dealing with a stream of elements >>>>>>>>>>>>>>>>>>> (which of course >>>>>>>>>>>>>>>>>>> will often be of size 1). That's a really natural paradigm >>>>>>>>>>>>>>>>>>> in the Swift >>>>>>>>>>>>>>>>>>> world especially with the async / await structures. So when >>>>>>>>>>>>>>>>>>> you see >>>>>>>>>>>>>>>>>>> something like: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> for try await (filename,_,_) in filenames { >>>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>> output.emit(data) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> filenames is the input stream and then output and errors >>>>>>>>>>>>>>>>>>> are both output streams. In theory you can have as many >>>>>>>>>>>>>>>>>>> output streams as >>>>>>>>>>>>>>>>>>> you like though at the moment there's a compiler bug in the >>>>>>>>>>>>>>>>>>> new type pack >>>>>>>>>>>>>>>>>>> feature that limits it to "as many as I felt like >>>>>>>>>>>>>>>>>>> supporting". Presumably >>>>>>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which >>>>>>>>>>>>>>>>>>> will probably be >>>>>>>>>>>>>>>>>>> in the October timeframe if history is any guide) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If you had parameterization you wanted to send that >>>>>>>>>>>>>>>>>>> would look like pardo("Parameter") { >>>>>>>>>>>>>>>>>>> param,filenames,output,error in ... } >>>>>>>>>>>>>>>>>>> where "param" would take on the value of "Parameter." All >>>>>>>>>>>>>>>>>>> of this is being >>>>>>>>>>>>>>>>>>> typechecked at compile time BTW. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like >>>>>>>>>>>>>>>>>>> you have in ES6 and other things where "_" is Swift for >>>>>>>>>>>>>>>>>>> "ignore." In this >>>>>>>>>>>>>>>>>>> case PCollectionStreams have an element signature of >>>>>>>>>>>>>>>>>>> (Of,Date,Window) so >>>>>>>>>>>>>>>>>>> you can optionally extract the timestamp and the window if >>>>>>>>>>>>>>>>>>> you want to >>>>>>>>>>>>>>>>>>> manipulate it somehow. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> That said it would also be natural to provide >>>>>>>>>>>>>>>>>>> elementwise pardos--- that would probably mean having >>>>>>>>>>>>>>>>>>> explicit type >>>>>>>>>>>>>>>>>>> signatures in the closure. I had that at one point, but it >>>>>>>>>>>>>>>>>>> felt less >>>>>>>>>>>>>>>>>>> natural the more I used it. I'm also slowly working towards >>>>>>>>>>>>>>>>>>> adding a more >>>>>>>>>>>>>>>>>>> "traditional" DoFn implementation approach where you >>>>>>>>>>>>>>>>>>> implement the DoFn as >>>>>>>>>>>>>>>>>>> an object type. In that case it would be very very easy to >>>>>>>>>>>>>>>>>>> support both by >>>>>>>>>>>>>>>>>>> having a default stream implementation call the equivalent >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>> processElement. To make that performant I need to implement >>>>>>>>>>>>>>>>>>> an @DoFn macro >>>>>>>>>>>>>>>>>>> and I just haven't gotten to it yet. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing >>>>>>>>>>>>>>>>>>> implementing composite and external transforms for the >>>>>>>>>>>>>>>>>>> reasons you suggest. >>>>>>>>>>>>>>>>>>> :-) I've got the basics of a composite transform (there's >>>>>>>>>>>>>>>>>>> an equivalent >>>>>>>>>>>>>>>>>>> wordcount example) and am hooking it into the pipeline >>>>>>>>>>>>>>>>>>> generation, which >>>>>>>>>>>>>>>>>>> should also give me everything I need to successfully hook >>>>>>>>>>>>>>>>>>> in external >>>>>>>>>>>>>>>>>>> transforms as well. That will give me the jump on IOs as >>>>>>>>>>>>>>>>>>> you say. I can >>>>>>>>>>>>>>>>>>> also treat the pipeline itself as a composite transform >>>>>>>>>>>>>>>>>>> which lets me get >>>>>>>>>>>>>>>>>>> rid of the Pipeline { pipeline in ... } and just instead >>>>>>>>>>>>>>>>>>> have things attach >>>>>>>>>>>>>>>>>>> themselves to the pipeline implicitly. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> That said, there are some interesting IO possibilities >>>>>>>>>>>>>>>>>>> that would be Swift native. In particularly, I've been >>>>>>>>>>>>>>>>>>> looking at the >>>>>>>>>>>>>>>>>>> native Swift binding for DuckDB (which is C++ based). >>>>>>>>>>>>>>>>>>> DuckDB is SQL based >>>>>>>>>>>>>>>>>>> but not distributed in the same was as, say, Beam SQL... >>>>>>>>>>>>>>>>>>> but it would allow >>>>>>>>>>>>>>>>>>> for SQL statements on individual files with projection >>>>>>>>>>>>>>>>>>> pushdown supported >>>>>>>>>>>>>>>>>>> for things like Parquet which could have some cool and >>>>>>>>>>>>>>>>>>> performant data lake >>>>>>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs >>>>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good >>>>>>>>>>>>>>>>>>> that would give >>>>>>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes >>>>>>>>>>>>>>>>>>> it pretty easy to >>>>>>>>>>>>>>>>>>> work with GCS. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute >>>>>>>>>>>>>>>>>>> here and there. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> B >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw < >>>>>>>>>>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Neat. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how >>>>>>>>>>>>>>>>>>>> the FnAPI works :). I like the use of groupBy. I have to >>>>>>>>>>>>>>>>>>>> admit I'm a bit >>>>>>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at >>>>>>>>>>>>>>>>>>>> all which is >>>>>>>>>>>>>>>>>>>> probably tripping me up). The addition of external >>>>>>>>>>>>>>>>>>>> (cross-language) >>>>>>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) >>>>>>>>>>>>>>>>>>>> pretty quickly from >>>>>>>>>>>>>>>>>>>> other SDKs. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user < >>>>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet >>>>>>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that >>>>>>>>>>>>>>>>>>>>> need to properly >>>>>>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed >>>>>>>>>>>>>>>>>>>>> values and length >>>>>>>>>>>>>>>>>>>>> in iterable coders as those both use specifically >>>>>>>>>>>>>>>>>>>>> bigendian representations) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis < >>>>>>>>>>>>>>>>>>>>> byronel...@google.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks Cham, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can >>>>>>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like >>>>>>>>>>>>>>>>>>>>>> since most of the LOC >>>>>>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I >>>>>>>>>>>>>>>>>>>>>> definitely want to add >>>>>>>>>>>>>>>>>>>>>> external transforms and may actually add that support >>>>>>>>>>>>>>>>>>>>>> before adding the >>>>>>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With >>>>>>>>>>>>>>>>>>>>>> the way the SDK is >>>>>>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a >>>>>>>>>>>>>>>>>>>>>> separate operation >>>>>>>>>>>>>>>>>>>>>> than defining a composite. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath < >>>>>>>>>>>>>>>>>>>>>> chamik...@google.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there >>>>>>>>>>>>>>>>>>>>>>> is interest in Swift SDK from folks currently >>>>>>>>>>>>>>>>>>>>>>> subscribed to the >>>>>>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev < >>>>>>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hello everyone, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to >>>>>>>>>>>>>>>>>>>>>>>> really understand how the Beam FnApi works and how it >>>>>>>>>>>>>>>>>>>>>>>> interacts with the >>>>>>>>>>>>>>>>>>>>>>>> Portable Runner. For me at least that usually means I >>>>>>>>>>>>>>>>>>>>>>>> need to write some >>>>>>>>>>>>>>>>>>>>>>>> code so I can see things happening in a debugger and >>>>>>>>>>>>>>>>>>>>>>>> to really prove to >>>>>>>>>>>>>>>>>>>>>>>> myself I understood what was going on I decided I >>>>>>>>>>>>>>>>>>>>>>>> couldn't use an existing >>>>>>>>>>>>>>>>>>>>>>>> SDK language to do it since there would be the >>>>>>>>>>>>>>>>>>>>>>>> temptation to read some code >>>>>>>>>>>>>>>>>>>>>>>> and convince myself that I actually understood what >>>>>>>>>>>>>>>>>>>>>>>> was going on. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to >>>>>>>>>>>>>>>>>>>>>>>> get a minimal FnApi integration going you end up >>>>>>>>>>>>>>>>>>>>>>>> writing a fair bit of an >>>>>>>>>>>>>>>>>>>>>>>> SDK. So I decided to take things to a point where I >>>>>>>>>>>>>>>>>>>>>>>> had an SDK that could >>>>>>>>>>>>>>>>>>>>>>>> execute a word count example via a portable runner >>>>>>>>>>>>>>>>>>>>>>>> backend. I've now >>>>>>>>>>>>>>>>>>>>>>>> reached that point and would like to submit my >>>>>>>>>>>>>>>>>>>>>>>> prototype SDK to the list >>>>>>>>>>>>>>>>>>>>>>>> for feedback. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode >>>>>>>>>>>>>>>>>>>>>>>> Beta using Swift 5.9 on Intel Macs, but should also >>>>>>>>>>>>>>>>>>>>>>>> work using beta builds >>>>>>>>>>>>>>>>>>>>>>>> of 5.9 for Linux running on Intel hardware. I haven't >>>>>>>>>>>>>>>>>>>>>>>> had a chance to try >>>>>>>>>>>>>>>>>>>>>>>> it on ARM hardware and make sure all of the endian >>>>>>>>>>>>>>>>>>>>>>>> checks are complete. The >>>>>>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count >>>>>>>>>>>>>>>>>>>>>>>> example that reads some >>>>>>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ >>>>>>>>>>>>>>>>>>>>>>>> functionality) and >>>>>>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations >>>>>>>>>>>>>>>>>>>>>>>> to get it past the >>>>>>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against >>>>>>>>>>>>>>>>>>>>>>>> the Python Portable >>>>>>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no >>>>>>>>>>>>>>>>>>>>>>>> Direct Runner at this >>>>>>>>>>>>>>>>>>>>>>>> time. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and >>>>>>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for >>>>>>>>>>>>>>>>>>>>>>>> example pardo was >>>>>>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In >>>>>>>>>>>>>>>>>>>>>>>> general I've tried to >>>>>>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy >>>>>>>>>>>>>>>>>>>>>>>> reliance on closures >>>>>>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms >>>>>>>>>>>>>>>>>>>>>>>> there's the beginnings of >>>>>>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative >>>>>>>>>>>>>>>>>>>>>>>> API for creating them. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to >>>>>>>>>>>>>>>>>>>>>>>> be implemented, like counters, metrics, windowing, >>>>>>>>>>>>>>>>>>>>>>>> state, timers, etc. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> This should be fine and we can get the code >>>>>>>>>>>>>>>>>>>>>>> documented without these features. I think support for >>>>>>>>>>>>>>>>>>>>>>> composites and >>>>>>>>>>>>>>>>>>>>>>> adding an external transform (see, Java >>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>, >>>>>>>>>>>>>>>>>>>>>>> Python >>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>, >>>>>>>>>>>>>>>>>>>>>>> Go >>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>, >>>>>>>>>>>>>>>>>>>>>>> TypeScript >>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>) >>>>>>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of >>>>>>>>>>>>>>>>>>>>>>> features (for example, >>>>>>>>>>>>>>>>>>>>>>> I/O connectors) for free. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a >>>>>>>>>>>>>>>>>>>>>>>> PR if folks are interested, though the "Swift Way" >>>>>>>>>>>>>>>>>>>>>>>> would be to have it in >>>>>>>>>>>>>>>>>>>>>>>> its own repo so that it can easily be used from the >>>>>>>>>>>>>>>>>>>>>>>> Swift Package Manager. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially). >>>>>>>>>>>>>>>>>>>>>>> Also it'll be easier to comment on a PR :) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Cham >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>>>>>>> B >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>