Re: [Request for Feedback] Swift SDK Prototype

Valentyn Tymofieiev via user Mon, 25 Sep 2023 10:07:05 -0700

Do we anticipate any short-term changes to the release process to start
releasing switft SDK artifacts or we can hold that off for a certain time
while SDK is in active development?


On Mon, Sep 25, 2023 at 9:56 AM Robert Burke <rob...@frantil.com> wrote:

> I lost this thread for a bit. I'm glad Prism showed some use while it's
> doing unfused stages!
>
> I have no objections to a separate repo, and in a "Beam Go SDK V3" world
> that's what I'd want as well, because it works better for the Go usage
> patterns and is more natural for the tooling. And it would be a cleaner way
> to do a full overhaul of the user API given the way Go has evolved since
> it's initial design, and our own experience with it. But that's a very
> different topic for when I have a real proposal around it.
>
> I do see the clean thread Kenn started, but since i have no objections,
> I'll leave it to silent consensus.
>
> I agree that copying/building the protos isn't a burden, since that's
> entirely what protos are for. We're already treating them as properly
> stable and not making breaking proto, so compatibility is maintained by
> normal proto behavior.
>
> Robert Burke
> Beam Go Busybody
>
> On Thu, Sep 21, 2023, 9:52 AM Byron Ellis via user <user@beam.apache.org>
> wrote:
>
>> Also, seems like we're getting something like a consensus? One the repo
>> exists I'm happy to do the slog work of moving everything around (though
>> I'm not a committer so somebody else actually has to do the pushes). We can
>> do that in chunks to make life easier on people and I'm not super concerned
>> with losing the commit history on my current branch
>>
>> On Wed, Sep 20, 2023 at 11:10 AM Byron Ellis <byronel...@google.com>
>> wrote:
>>
>>> I actually don't think we'll need any of the multi-repo github actions,
>>> Swift packages are basically 1:1 with repos so the build process will
>>> actually do all the checkouts. What we'd do is put a test package in the
>>> sdks/swift, which works fine since it doesn't ever get used as a dependency
>>> that depends on the swift SDKs with the appropriate dependencies we want to
>>> make sure we're testing. This should also catch breaking changes to the
>>> protos (which in theory proto is helping us avoid).
>>>
>>> Syncing the protos hasn't been a huge deal and it's already scripted so
>>> definitely easy to automate. I  also don't think we would want to do that
>>> all the time anyway as that would require pipeline authors to install
>>> protoc for something that doesn't happen all that often. We can take care
>>> of that for users.
>>>
>>>
>>> On Wed, Sep 20, 2023 at 10:48 AM Danny McCormick <
>>> dannymccorm...@google.com> wrote:
>>>
>>>> > I think the process should be similar to other code/design reviews
>>>> for large contributions. I don't think you need a PMC involvement here.
>>>>
>>>> I think it does require PMC involvement to create the actual repo once
>>>> we have public consensus. I tried the flow at
>>>> https://infra.apache.org/version-control.html#create but it seems like
>>>> its PMC only. It's unclear to me if consensus has been achieved, maybe a
>>>> dedicated voting thread with implied lazy consensus would help here.
>>>>
>>>> > Sure, we could definitely include things as a submodule for stuff
>>>> like testing multi-language, though I think there's actually a cleaner way
>>>> just using the Swift package manager's test facilities to access the swift
>>>> sdk repo.
>>>>
>>>> +1 on avoiding submodules. If needed we could also use multi-repo
>>>> checkout with GitHub Actions. I think my biggest question is what we'd
>>>> actually be enforcing though. In general, I'd expect the normal update flow
>>>> to be
>>>>
>>>> 1) Update Beam protos and/or multi-lang components (though the set of
>>>> things that needs updated for multi-lang is unclear to me)
>>>> 2) Mirror those changes to the Swift SDK.
>>>>
>>>> The thing that is most likely to be forgotten is the 2nd step, and that
>>>> is hard to enforce with automation since the automation would either be on
>>>> the first step which doesn't have anything to enforce or on some sort of
>>>> schedule in the swift repo, which is less likely to be visible. I'm a
>>>> little worried we wouldn't notice breakages until release time.
>>>>
>>>> I wonder how much stuff happens outside of the proto directory that
>>>> needs to be mirrored. Could we just create scheduled automation to exactly
>>>> copy changes in the proto directory and version changes for multi-lang
>>>> stuff to the swift SDK repo?
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>> Regardless, I'm +1 on a dedicated repo; I'd rather we take on some
>>>> organizational weirdness than push that pain to users.
>>>>
>>>> Thanks,
>>>> Danny
>>>>
>>>> On Wed, Sep 20, 2023 at 1:38 PM Byron Ellis via user <
>>>> user@beam.apache.org> wrote:
>>>>
>>>>> Sure, we could definitely include things as a submodule for stuff like
>>>>> testing multi-language, though I think there's actually a cleaner way just
>>>>> using the Swift package manager's test facilities to access the swift sdk
>>>>> repo.
>>>>>
>>>>>  That would also be consistent with the user-side experience and let
>>>>> us test things like build-time integrations with multi-language as well
>>>>> (which is possible in Swift through compiler plugins) in the same way as a
>>>>> pipeline author would. You also maybe get backwards compatibility testing
>>>>> as a side effect in that case as well.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath <
>>>>> chamik...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've chatted with a couple of people offline about this and my
>>>>>>> impression is that folks are generally amenable to a separate repo to 
>>>>>>> match
>>>>>>> the target community? I have no idea what the next steps would be though
>>>>>>> other than guessing that there's probably some sort of PMC thing 
>>>>>>> involved?
>>>>>>> Should I write something up somewhere?
>>>>>>>
>>>>>>
>>>>>> I think the process should be similar to other code/design reviews
>>>>>> for large contributions. I don't think you need a PMC involvement here.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> B
>>>>>>>
>>>>>>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've been on vacation, but mostly working on getting External
>>>>>>>> Transform support going (which in turn basically requires Schema 
>>>>>>>> support as
>>>>>>>> well). It also looks like macros landed in Swift 5.9 for Linux so 
>>>>>>>> we'll be
>>>>>>>> able to use those to do some compile-time automation. In particular, 
>>>>>>>> this
>>>>>>>> lets us do something similar to what Java does with ByteBuddy for
>>>>>>>> generating schema coders though it has to be ahead of time so not 
>>>>>>>> quite the
>>>>>>>> same. (As far as I can tell this is a reason why macros got added to 
>>>>>>>> the
>>>>>>>> language in the first place---Apple's SwiftData library makes heavy 
>>>>>>>> use of
>>>>>>>> the feature).
>>>>>>>>
>>>>>>>> I do have one question for the group though: should the Swift SDK
>>>>>>>> distribution take on Beam community properties or Swift community
>>>>>>>> properties? Specifically, in the Swift world the Swift SDK would live 
>>>>>>>> in
>>>>>>>> its own repo (beam-swift for example), which allows it to be most 
>>>>>>>> easily
>>>>>>>> consumed and keeps the checkout size under control for users. 
>>>>>>>> "Releases" in
>>>>>>>> the Swift world (much like Go) are just repo tags. The downside here is
>>>>>>>> that there's overhead in setting up the various github actions and 
>>>>>>>> other
>>>>>>>> CI/CD bits and bobs.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>> The alternative would be to keep it in the beam repo itself like it
>>>>>>>> is now, but we'd probably want to move Package.swift to the root since 
>>>>>>>> for
>>>>>>>> whatever reason the Swift community (much to some people's annoyance) 
>>>>>>>> has
>>>>>>>> chosen to have packages only really able to live at the top of a repo. 
>>>>>>>> This
>>>>>>>> has less overhead from a CI/CD perspective, but lots of overhead for 
>>>>>>>> users
>>>>>>>> as they'd be checking out the entire Beam repo to use the SDK, which
>>>>>>>> happens a lot.
>>>>>>>>
>>>>>>>> There's a third option which is basically "do both" but honestly
>>>>>>>> that just seems like the worst of both worlds as it would require 
>>>>>>>> constant
>>>>>>>> syncing if we wanted to make it possible for Swift users to target
>>>>>>>> unreleased SDKs for development and testing.
>>>>>>>>
>>>>>>>> Personally, I would lean towards the former option (and would
>>>>>>>> volunteer to set up & document the various automations) as it is 
>>>>>>>> lighter
>>>>>>>> for the actual users of the SDK and more consistent with the community
>>>>>>>> experience they expect. The CI/CD stuff is mostly a "do it once" 
>>>>>>>> whereas
>>>>>>>> checking out the entire repo with many updates the user doesn't care 
>>>>>>>> about
>>>>>>>> is something they will be doing all the time. FWIW some of our 
>>>>>>>> dependencies
>>>>>>>> also chose this route---most notably GRPC which started with the latter
>>>>>>>> approach and has moved to the former.
>>>>>>>>
>>>>>>>
>>>>>> I believe existing SDKs benefit from living in the same repo. For
>>>>>> example, it's easier to keep them consistent with any model/proto changes
>>>>>> and it's easier to manage distributions/tags. Also it's easier to keep
>>>>>> components consistent for multi-lang. If we add Swift to a separate repo,
>>>>>> we'll probably have to add tooling/scripts to keep things consistent.
>>>>>> Is it possible to create a separate repo, but also add a reference
>>>>>> (and Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests 
>>>>>> to
>>>>>> make sure that things stay consistent ?
>>>>>>
>>>>>> Thanks,
>>>>>> Cham
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Interested to hear any feedback on the subject since I'm guessing
>>>>>>>> it probably came up with the Go SDK back in the day?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> B
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> After a couple of iterations (thanks rebo!) we've also gotten the
>>>>>>>>> Swift SDK working with the new Prism runner. The fact that it doesn't 
>>>>>>>>> do
>>>>>>>>> fusion caught a couple of configuration bugs (e.g. that the grpc 
>>>>>>>>> message
>>>>>>>>> receiver buffer should be fairly large). It would seem that at the 
>>>>>>>>> moment
>>>>>>>>> Prism and the Flink runner have similar orders of strictness when
>>>>>>>>> interpreting the pipeline graph while the Python portable runner is 
>>>>>>>>> far
>>>>>>>>> more forgiving.
>>>>>>>>>
>>>>>>>>> Also added support for bounded vs unbounded pcollections through
>>>>>>>>> the "type" parameter when adding a pardo. Impulse is a bounded 
>>>>>>>>> pcollection
>>>>>>>>> I believe?
>>>>>>>>>
>>>>>>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Okay, after a brief detour through "get this working in the Flink
>>>>>>>>>> Portable Runner" I think I have something pretty workable.
>>>>>>>>>>
>>>>>>>>>> PInput and POutput can actually be structs rather than protocols,
>>>>>>>>>> which simplifies things quite a bit. It also allows us to use them 
>>>>>>>>>> with
>>>>>>>>>> property wrappers for a SwiftUI-like experience if we want when 
>>>>>>>>>> defining
>>>>>>>>>> DoFns (which is what I was originally intending to use them for). 
>>>>>>>>>> That also
>>>>>>>>>> means the function signature you use for closures would match 
>>>>>>>>>> full-fledged
>>>>>>>>>> DoFn definitions for the most part which is satisfying.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis <
>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Okay, I tried a couple of different things.
>>>>>>>>>>>
>>>>>>>>>>> Implicitly passing the timestamp and window during iteration did
>>>>>>>>>>> not go well. While physically possible it introduces an invisible 
>>>>>>>>>>> side
>>>>>>>>>>> effect into loop iteration which confused me when I tried to use it 
>>>>>>>>>>> and I
>>>>>>>>>>> implemented it. Also, I'm pretty sure there'd end up being some 
>>>>>>>>>>> sort of
>>>>>>>>>>> race condition nightmare continuing down that path.
>>>>>>>>>>>
>>>>>>>>>>> What I decided to do instead was the following:
>>>>>>>>>>>
>>>>>>>>>>> 1. Rename the existing "pardo" functions to "pstream" and
>>>>>>>>>>> require that they always emit a window and timestamp along with 
>>>>>>>>>>> their
>>>>>>>>>>> value. This eliminates the side effect but lets us keep iteration 
>>>>>>>>>>> in a
>>>>>>>>>>> bundle where that might be convenient. For example, in my cheesy GCS
>>>>>>>>>>> implementation it means that I can keep an OAuth token around for 
>>>>>>>>>>> the
>>>>>>>>>>> lifetime of the bundle as a local variable, which is convenient. 
>>>>>>>>>>> It's a bit
>>>>>>>>>>> more typing for users of pstream, but the expectation here is that 
>>>>>>>>>>> if
>>>>>>>>>>> you're using pstream functions You Know What You Are Doing and most 
>>>>>>>>>>> people
>>>>>>>>>>> won't be using it directly.
>>>>>>>>>>>
>>>>>>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of
>>>>>>>>>>> them yet, but enough to test the functionality and decide I liked 
>>>>>>>>>>> it) which
>>>>>>>>>>> take a function signature of (any PInput<InputType>,any
>>>>>>>>>>> POutput<OutputType>). PInput takes the (InputType,Date,Window) 
>>>>>>>>>>> tuple and
>>>>>>>>>>> converts it into a struct with friendlier names. Not strictly 
>>>>>>>>>>> necessary,
>>>>>>>>>>> but makes the code nicer to read I think. POutput introduces emit 
>>>>>>>>>>> functions
>>>>>>>>>>> that optionally allow you to specify a timestamp and a window. If 
>>>>>>>>>>> you don't
>>>>>>>>>>> for either one it will take the timestamp and/or window of the 
>>>>>>>>>>> input.
>>>>>>>>>>>
>>>>>>>>>>> Trying to use that was pretty pleasant to use so I think we
>>>>>>>>>>> should continue down that path. If you'd like to see it in use, I
>>>>>>>>>>> reimplemented map() and flatMap() in terms of this new pardo 
>>>>>>>>>>> functionality.
>>>>>>>>>>>
>>>>>>>>>>> Code has been pushed to the branch/PR if you're interested in
>>>>>>>>>>> taking a look.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis <
>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Gotcha, I think there's a fairly easy solution to link input
>>>>>>>>>>>> and output streams.... Let me try it out... might even be possible 
>>>>>>>>>>>> to have
>>>>>>>>>>>> both element and stream-wise closure pardos. Definitely possible 
>>>>>>>>>>>> to have
>>>>>>>>>>>> that at the DoFn level (called SerializableFn in the SDK because I 
>>>>>>>>>>>> want to
>>>>>>>>>>>> use @DoFn as a macro)
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw <
>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath <
>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw <
>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like to figure out a way to get the stream-y
>>>>>>>>>>>>>>> interface to work, as I think it's more natural overall.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One hypothesis is that if any elements are carried over loop
>>>>>>>>>>>>>>> iterations, there will likely be some that are carried over 
>>>>>>>>>>>>>>> beyond the loop
>>>>>>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to 
>>>>>>>>>>>>>>> end). We
>>>>>>>>>>>>>>> could reject "plain" elements that are emitted after this 
>>>>>>>>>>>>>>> point, requiring
>>>>>>>>>>>>>>> one to emit timestamp-windowed-values.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are you assuming that the same stream (or overlapping sets of
>>>>>>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of 
>>>>>>>>>>>>>> data
>>>>>>>>>>>>>> streamed here are the data that belong to the current bundle 
>>>>>>>>>>>>>> (hence already
>>>>>>>>>>>>>> assigned to the current worker) so any output from the current 
>>>>>>>>>>>>>> bundle
>>>>>>>>>>>>>> invocation would be a valid output of that bundle.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, the content of the stream is exactly the contents of the
>>>>>>>>>>>>> bundle. The question is how to do the input_element:output_element
>>>>>>>>>>>>> correlation for automatically propagating metadata.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Related to this, we could enforce that the only
>>>>>>>>>>>>>>> (user-accessible) way to get such a timestamped value is to 
>>>>>>>>>>>>>>> start with one,
>>>>>>>>>>>>>>> e.g. a WindowedValue<T>.withValue(O) produces a 
>>>>>>>>>>>>>>> WindowedValue<O> with the
>>>>>>>>>>>>>>> same metadata but a new value. Thus a user wanting to do 
>>>>>>>>>>>>>>> anything "fancy"
>>>>>>>>>>>>>>> would have to explicitly request iteration over these windowed 
>>>>>>>>>>>>>>> values
>>>>>>>>>>>>>>> rather than over the raw elements. (This is also forward 
>>>>>>>>>>>>>>> compatible with
>>>>>>>>>>>>>>> expanding the metadata that can get attached, e.g. pane infos, 
>>>>>>>>>>>>>>> and makes
>>>>>>>>>>>>>>> the right thing the easiest/most natural.)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis <
>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ah, that is a good point—being element-wise would make
>>>>>>>>>>>>>>>> managing windows and time stamps easier for the user. 
>>>>>>>>>>>>>>>> Fortunately it’s a
>>>>>>>>>>>>>>>> fairly easy change to make and maybe even less typing for the 
>>>>>>>>>>>>>>>> user. I was
>>>>>>>>>>>>>>>> originally thinking side inputs and metrics would happen 
>>>>>>>>>>>>>>>> outside the loop,
>>>>>>>>>>>>>>>> but I think you want a class and not a closure at that point 
>>>>>>>>>>>>>>>> for sanity.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw <
>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ah, I see.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole
>>>>>>>>>>>>>>>>> bundle rather than start/finish bundle callbacks, but one of 
>>>>>>>>>>>>>>>>> the questions
>>>>>>>>>>>>>>>>> is how that would impact implicit passing of the timestamp 
>>>>>>>>>>>>>>>>> (and other)
>>>>>>>>>>>>>>>>> metadata from input elements to output elements. (You can of 
>>>>>>>>>>>>>>>>> course attach
>>>>>>>>>>>>>>>>> the metadata to any output that happens in the loop body, but 
>>>>>>>>>>>>>>>>> it's very
>>>>>>>>>>>>>>>>> easy to implicitly to break the 1:1 relationship here (e.g. 
>>>>>>>>>>>>>>>>> by doing
>>>>>>>>>>>>>>>>> buffering or otherwise modifying local state) and this would 
>>>>>>>>>>>>>>>>> be hard to
>>>>>>>>>>>>>>>>> detect. (I suppose trying to output after the loop finishes 
>>>>>>>>>>>>>>>>> could require
>>>>>>>>>>>>>>>>> something more explicit).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis <
>>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise
>>>>>>>>>>>>>>>>>> collection operations like "map" that eliminate the need for 
>>>>>>>>>>>>>>>>>> pardo in many
>>>>>>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey 
>>>>>>>>>>>>>>>>>> under the hood.
>>>>>>>>>>>>>>>>>> That was to be more consistent with Swift's collection 
>>>>>>>>>>>>>>>>>> protocol (and is
>>>>>>>>>>>>>>>>>> also why PCollection and PCollectionStream are different 
>>>>>>>>>>>>>>>>>> types...
>>>>>>>>>>>>>>>>>> PCollection implements map and friends as pipeline 
>>>>>>>>>>>>>>>>>> construction operations
>>>>>>>>>>>>>>>>>> whereas PCollectionStream is an actual stream)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses
>>>>>>>>>>>>>>>>>> map rather than pardo in a couple of places to do a true 
>>>>>>>>>>>>>>>>>> wordcount using
>>>>>>>>>>>>>>>>>> good ol' Shakespeare and very very primitive GCS IO.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <
>>>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo
>>>>>>>>>>>>>>>>>>> syntax quite a bit before settling on where I ended up. 
>>>>>>>>>>>>>>>>>>> Ultimately I
>>>>>>>>>>>>>>>>>>> decided to go with something that felt more Swift-y than 
>>>>>>>>>>>>>>>>>>> anything else
>>>>>>>>>>>>>>>>>>> which means that rather than dealing with a single element 
>>>>>>>>>>>>>>>>>>> like you do in
>>>>>>>>>>>>>>>>>>> the other SDKs you're dealing with a stream of elements 
>>>>>>>>>>>>>>>>>>> (which of course
>>>>>>>>>>>>>>>>>>> will often be of size 1). That's a really natural paradigm 
>>>>>>>>>>>>>>>>>>> in the Swift
>>>>>>>>>>>>>>>>>>> world especially with the async / await structures. So when 
>>>>>>>>>>>>>>>>>>> you see
>>>>>>>>>>>>>>>>>>> something like:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> filenames is the input stream and then output and errors
>>>>>>>>>>>>>>>>>>> are both output streams. In theory you can have as many 
>>>>>>>>>>>>>>>>>>> output streams as
>>>>>>>>>>>>>>>>>>> you like though at the moment there's a compiler bug in the 
>>>>>>>>>>>>>>>>>>> new type pack
>>>>>>>>>>>>>>>>>>> feature that limits it to "as many as I felt like 
>>>>>>>>>>>>>>>>>>> supporting". Presumably
>>>>>>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which 
>>>>>>>>>>>>>>>>>>> will probably be
>>>>>>>>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If you had parameterization you wanted to send that
>>>>>>>>>>>>>>>>>>> would look like pardo("Parameter") { 
>>>>>>>>>>>>>>>>>>> param,filenames,output,error in ... }
>>>>>>>>>>>>>>>>>>> where "param" would take on the value of "Parameter." All 
>>>>>>>>>>>>>>>>>>> of this is being
>>>>>>>>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like
>>>>>>>>>>>>>>>>>>> you have in ES6 and other things where "_" is Swift for 
>>>>>>>>>>>>>>>>>>> "ignore." In this
>>>>>>>>>>>>>>>>>>> case PCollectionStreams have an element signature of 
>>>>>>>>>>>>>>>>>>> (Of,Date,Window) so
>>>>>>>>>>>>>>>>>>> you can optionally extract the timestamp and the window if 
>>>>>>>>>>>>>>>>>>> you want to
>>>>>>>>>>>>>>>>>>> manipulate it somehow.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That said it would also be natural to provide
>>>>>>>>>>>>>>>>>>> elementwise pardos--- that would probably mean having 
>>>>>>>>>>>>>>>>>>> explicit type
>>>>>>>>>>>>>>>>>>> signatures in the closure. I had that at one point, but it 
>>>>>>>>>>>>>>>>>>> felt less
>>>>>>>>>>>>>>>>>>> natural the more I used it. I'm also slowly working towards 
>>>>>>>>>>>>>>>>>>> adding a more
>>>>>>>>>>>>>>>>>>> "traditional" DoFn implementation approach where you 
>>>>>>>>>>>>>>>>>>> implement the DoFn as
>>>>>>>>>>>>>>>>>>> an object type. In that case it would be very very easy to 
>>>>>>>>>>>>>>>>>>> support both by
>>>>>>>>>>>>>>>>>>> having a default stream implementation call the equivalent 
>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>> processElement. To make that performant I need to implement 
>>>>>>>>>>>>>>>>>>> an @DoFn macro
>>>>>>>>>>>>>>>>>>> and I just haven't gotten to it yet.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing
>>>>>>>>>>>>>>>>>>> implementing composite and external transforms for the 
>>>>>>>>>>>>>>>>>>> reasons you suggest.
>>>>>>>>>>>>>>>>>>> :-) I've got the basics of a composite transform (there's 
>>>>>>>>>>>>>>>>>>> an equivalent
>>>>>>>>>>>>>>>>>>> wordcount example) and am hooking it into the pipeline 
>>>>>>>>>>>>>>>>>>> generation, which
>>>>>>>>>>>>>>>>>>> should also give me everything I need to successfully hook 
>>>>>>>>>>>>>>>>>>> in external
>>>>>>>>>>>>>>>>>>> transforms as well. That will give me the jump on IOs as 
>>>>>>>>>>>>>>>>>>> you say. I can
>>>>>>>>>>>>>>>>>>> also treat the pipeline itself as a composite transform 
>>>>>>>>>>>>>>>>>>> which lets me get
>>>>>>>>>>>>>>>>>>> rid of the Pipeline { pipeline in ... } and just instead 
>>>>>>>>>>>>>>>>>>> have things attach
>>>>>>>>>>>>>>>>>>> themselves to the pipeline implicitly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That said, there are some interesting IO possibilities
>>>>>>>>>>>>>>>>>>> that would be Swift native. In particularly, I've been 
>>>>>>>>>>>>>>>>>>> looking at the
>>>>>>>>>>>>>>>>>>> native Swift binding for DuckDB (which is C++ based). 
>>>>>>>>>>>>>>>>>>> DuckDB is SQL based
>>>>>>>>>>>>>>>>>>> but not distributed in the same was as, say, Beam SQL... 
>>>>>>>>>>>>>>>>>>> but it would allow
>>>>>>>>>>>>>>>>>>> for SQL statements on individual files with projection 
>>>>>>>>>>>>>>>>>>> pushdown supported
>>>>>>>>>>>>>>>>>>> for things like Parquet which could have some cool and 
>>>>>>>>>>>>>>>>>>> performant data lake
>>>>>>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs 
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good 
>>>>>>>>>>>>>>>>>>> that would give
>>>>>>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes 
>>>>>>>>>>>>>>>>>>> it pretty easy to
>>>>>>>>>>>>>>>>>>> work with GCS.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute
>>>>>>>>>>>>>>>>>>> here and there.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Neat.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how
>>>>>>>>>>>>>>>>>>>> the FnAPI works :). I like the use of groupBy. I have to 
>>>>>>>>>>>>>>>>>>>> admit I'm a bit
>>>>>>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at 
>>>>>>>>>>>>>>>>>>>> all which is
>>>>>>>>>>>>>>>>>>>> probably tripping me up). The addition of external 
>>>>>>>>>>>>>>>>>>>> (cross-language)
>>>>>>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) 
>>>>>>>>>>>>>>>>>>>> pretty quickly from
>>>>>>>>>>>>>>>>>>>> other SDKs.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet
>>>>>>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that 
>>>>>>>>>>>>>>>>>>>>> need to properly
>>>>>>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed 
>>>>>>>>>>>>>>>>>>>>> values and length
>>>>>>>>>>>>>>>>>>>>> in iterable coders as those both use specifically 
>>>>>>>>>>>>>>>>>>>>> bigendian representations)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <
>>>>>>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks Cham,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like 
>>>>>>>>>>>>>>>>>>>>>> since most of the LOC
>>>>>>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I 
>>>>>>>>>>>>>>>>>>>>>> definitely want to add
>>>>>>>>>>>>>>>>>>>>>> external transforms and may actually add that support 
>>>>>>>>>>>>>>>>>>>>>> before adding the
>>>>>>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With 
>>>>>>>>>>>>>>>>>>>>>> the way the SDK is
>>>>>>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a 
>>>>>>>>>>>>>>>>>>>>>> separate operation
>>>>>>>>>>>>>>>>>>>>>> than defining a composite.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there
>>>>>>>>>>>>>>>>>>>>>>> is interest in Swift SDK from folks currently 
>>>>>>>>>>>>>>>>>>>>>>> subscribed to the
>>>>>>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to
>>>>>>>>>>>>>>>>>>>>>>>> really understand how the Beam FnApi works and how it 
>>>>>>>>>>>>>>>>>>>>>>>> interacts with the
>>>>>>>>>>>>>>>>>>>>>>>> Portable Runner. For me at least that usually means I 
>>>>>>>>>>>>>>>>>>>>>>>> need to write some
>>>>>>>>>>>>>>>>>>>>>>>> code so I can see things happening in a debugger and 
>>>>>>>>>>>>>>>>>>>>>>>> to really prove to
>>>>>>>>>>>>>>>>>>>>>>>> myself I understood what was going on I decided I 
>>>>>>>>>>>>>>>>>>>>>>>> couldn't use an existing
>>>>>>>>>>>>>>>>>>>>>>>> SDK language to do it since there would be the 
>>>>>>>>>>>>>>>>>>>>>>>> temptation to read some code
>>>>>>>>>>>>>>>>>>>>>>>> and convince myself that I actually understood what 
>>>>>>>>>>>>>>>>>>>>>>>> was going on.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to
>>>>>>>>>>>>>>>>>>>>>>>> get a minimal FnApi integration going you end up 
>>>>>>>>>>>>>>>>>>>>>>>> writing a fair bit of an
>>>>>>>>>>>>>>>>>>>>>>>> SDK. So I decided to take things to a point where I 
>>>>>>>>>>>>>>>>>>>>>>>> had an SDK that could
>>>>>>>>>>>>>>>>>>>>>>>> execute a word count example via a portable runner 
>>>>>>>>>>>>>>>>>>>>>>>> backend. I've now
>>>>>>>>>>>>>>>>>>>>>>>> reached that point and would like to submit my 
>>>>>>>>>>>>>>>>>>>>>>>> prototype SDK to the list
>>>>>>>>>>>>>>>>>>>>>>>> for feedback.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode
>>>>>>>>>>>>>>>>>>>>>>>> Beta using Swift 5.9 on Intel Macs, but should also 
>>>>>>>>>>>>>>>>>>>>>>>> work using beta builds
>>>>>>>>>>>>>>>>>>>>>>>> of 5.9 for Linux running on Intel hardware. I haven't 
>>>>>>>>>>>>>>>>>>>>>>>> had a chance to try
>>>>>>>>>>>>>>>>>>>>>>>> it on ARM hardware and make sure all of the endian 
>>>>>>>>>>>>>>>>>>>>>>>> checks are complete. The
>>>>>>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count 
>>>>>>>>>>>>>>>>>>>>>>>> example that reads some
>>>>>>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ 
>>>>>>>>>>>>>>>>>>>>>>>> functionality) and
>>>>>>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations 
>>>>>>>>>>>>>>>>>>>>>>>> to get it past the
>>>>>>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against 
>>>>>>>>>>>>>>>>>>>>>>>> the Python Portable
>>>>>>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no 
>>>>>>>>>>>>>>>>>>>>>>>> Direct Runner at this
>>>>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and
>>>>>>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for 
>>>>>>>>>>>>>>>>>>>>>>>> example pardo was
>>>>>>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In 
>>>>>>>>>>>>>>>>>>>>>>>> general I've tried to
>>>>>>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy 
>>>>>>>>>>>>>>>>>>>>>>>> reliance on closures
>>>>>>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms 
>>>>>>>>>>>>>>>>>>>>>>>> there's the beginnings of
>>>>>>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative 
>>>>>>>>>>>>>>>>>>>>>>>> API for creating them.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to
>>>>>>>>>>>>>>>>>>>>>>>> be implemented, like counters, metrics, windowing, 
>>>>>>>>>>>>>>>>>>>>>>>> state, timers, etc.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This should be fine and we can get the code
>>>>>>>>>>>>>>>>>>>>>>> documented without these features. I think support for 
>>>>>>>>>>>>>>>>>>>>>>> composites and
>>>>>>>>>>>>>>>>>>>>>>> adding an external transform (see, Java
>>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>>>>>>>>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>>>>>>>>>>>>>>>>>>>>> Go
>>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>>>>>>>>>>>>>>>>>>>>> TypeScript
>>>>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>>>>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of 
>>>>>>>>>>>>>>>>>>>>>>> features (for example,
>>>>>>>>>>>>>>>>>>>>>>> I/O connectors) for free.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a
>>>>>>>>>>>>>>>>>>>>>>>> PR if folks are interested, though the "Swift Way" 
>>>>>>>>>>>>>>>>>>>>>>>> would be to have it in
>>>>>>>>>>>>>>>>>>>>>>>> its own repo so that it can easily be used from the 
>>>>>>>>>>>>>>>>>>>>>>>> Swift Package Manager.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially).
>>>>>>>>>>>>>>>>>>>>>>> Also it'll be easier to comment on a PR :)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> - Cham
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

Reply via email to