Re: [Request for Feedback] Swift SDK Prototype

Byron Ellis via user Wed, 20 Sep 2023 10:38:37 -0700

Sure, we could definitely include things as a submodule for stuff like
testing multi-language, though I think there's actually a cleaner way just
using the Swift package manager's test facilities to access the swift sdk
repo.


 That would also be consistent with the user-side experience and let us
test things like build-time integrations with multi-language as well (which
is possible in Swift through compiler plugins) in the same way as a
pipeline author would. You also maybe get backwards compatibility testing
as a side effect in that case as well.






On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath <chamik...@google.com>
wrote:

>
>
>
> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com> wrote:
>
>> Hi all,
>>
>> I've chatted with a couple of people offline about this and my impression
>> is that folks are generally amenable to a separate repo to match the target
>> community? I have no idea what the next steps would be though other than
>> guessing that there's probably some sort of PMC thing involved? Should I
>> write something up somewhere?
>>
>
> I think the process should be similar to other code/design reviews for
> large contributions. I don't think you need a PMC involvement here.
>
>
>>
>> Best,
>> B
>>
>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I've been on vacation, but mostly working on getting External Transform
>>> support going (which in turn basically requires Schema support as well). It
>>> also looks like macros landed in Swift 5.9 for Linux so we'll be able to
>>> use those to do some compile-time automation. In particular, this lets us
>>> do something similar to what Java does with ByteBuddy for generating schema
>>> coders though it has to be ahead of time so not quite the same. (As far as
>>> I can tell this is a reason why macros got added to the language in the
>>> first place---Apple's SwiftData library makes heavy use of the feature).
>>>
>>> I do have one question for the group though: should the Swift SDK
>>> distribution take on Beam community properties or Swift community
>>> properties? Specifically, in the Swift world the Swift SDK would live in
>>> its own repo (beam-swift for example), which allows it to be most easily
>>> consumed and keeps the checkout size under control for users. "Releases" in
>>> the Swift world (much like Go) are just repo tags. The downside here is
>>> that there's overhead in setting up the various github actions and other
>>> CI/CD bits and bobs.
>>>
>>>
>
>> The alternative would be to keep it in the beam repo itself like it is
>>> now, but we'd probably want to move Package.swift to the root since for
>>> whatever reason the Swift community (much to some people's annoyance) has
>>> chosen to have packages only really able to live at the top of a repo. This
>>> has less overhead from a CI/CD perspective, but lots of overhead for users
>>> as they'd be checking out the entire Beam repo to use the SDK, which
>>> happens a lot.
>>>
>>> There's a third option which is basically "do both" but honestly that
>>> just seems like the worst of both worlds as it would require constant
>>> syncing if we wanted to make it possible for Swift users to target
>>> unreleased SDKs for development and testing.
>>>
>>> Personally, I would lean towards the former option (and would volunteer
>>> to set up & document the various automations) as it is lighter for the
>>> actual users of the SDK and more consistent with the community experience
>>> they expect. The CI/CD stuff is mostly a "do it once" whereas checking out
>>> the entire repo with many updates the user doesn't care about is something
>>> they will be doing all the time. FWIW some of our dependencies also chose
>>> this route---most notably GRPC which started with the latter approach and
>>> has moved to the former.
>>>
>>
> I believe existing SDKs benefit from living in the same repo. For example,
> it's easier to keep them consistent with any model/proto changes and it's
> easier to manage distributions/tags. Also it's easier to keep components
> consistent for multi-lang. If we add Swift to a separate repo, we'll
> probably have to add tooling/scripts to keep things consistent.
> Is it possible to create a separate repo, but also add a reference (and
> Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests to make
> sure that things stay consistent ?
>
> Thanks,
> Cham
>
>
>>
>>> Interested to hear any feedback on the subject since I'm guessing it
>>> probably came up with the Go SDK back in the day?
>>>
>>> Best,
>>> B
>>>
>>>
>>>
>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> After a couple of iterations (thanks rebo!) we've also gotten the Swift
>>>> SDK working with the new Prism runner. The fact that it doesn't do fusion
>>>> caught a couple of configuration bugs (e.g. that the grpc message receiver
>>>> buffer should be fairly large). It would seem that at the moment Prism and
>>>> the Flink runner have similar orders of strictness when interpreting the
>>>> pipeline graph while the Python portable runner is far more forgiving.
>>>>
>>>> Also added support for bounded vs unbounded pcollections through the
>>>> "type" parameter when adding a pardo. Impulse is a bounded pcollection I
>>>> believe?
>>>>
>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com>
>>>> wrote:
>>>>
>>>>> Okay, after a brief detour through "get this working in the Flink
>>>>> Portable Runner" I think I have something pretty workable.
>>>>>
>>>>> PInput and POutput can actually be structs rather than protocols,
>>>>> which simplifies things quite a bit. It also allows us to use them with
>>>>> property wrappers for a SwiftUI-like experience if we want when defining
>>>>> DoFns (which is what I was originally intending to use them for). That 
>>>>> also
>>>>> means the function signature you use for closures would match full-fledged
>>>>> DoFn definitions for the most part which is satisfying.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis <byronel...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Okay, I tried a couple of different things.
>>>>>>
>>>>>> Implicitly passing the timestamp and window during iteration did not
>>>>>> go well. While physically possible it introduces an invisible side effect
>>>>>> into loop iteration which confused me when I tried to use it and I
>>>>>> implemented it. Also, I'm pretty sure there'd end up being some sort of
>>>>>> race condition nightmare continuing down that path.
>>>>>>
>>>>>> What I decided to do instead was the following:
>>>>>>
>>>>>> 1. Rename the existing "pardo" functions to "pstream" and require
>>>>>> that they always emit a window and timestamp along with their value. This
>>>>>> eliminates the side effect but lets us keep iteration in a bundle where
>>>>>> that might be convenient. For example, in my cheesy GCS implementation it
>>>>>> means that I can keep an OAuth token around for the lifetime of the 
>>>>>> bundle
>>>>>> as a local variable, which is convenient. It's a bit more typing for 
>>>>>> users
>>>>>> of pstream, but the expectation here is that if you're using pstream
>>>>>> functions You Know What You Are Doing and most people won't be using it
>>>>>> directly.
>>>>>>
>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of them
>>>>>> yet, but enough to test the functionality and decide I liked it) which 
>>>>>> take
>>>>>> a function signature of (any PInput<InputType>,any POutput<OutputType>).
>>>>>> PInput takes the (InputType,Date,Window) tuple and converts it into a
>>>>>> struct with friendlier names. Not strictly necessary, but makes the code
>>>>>> nicer to read I think. POutput introduces emit functions that optionally
>>>>>> allow you to specify a timestamp and a window. If you don't for either 
>>>>>> one
>>>>>> it will take the timestamp and/or window of the input.
>>>>>>
>>>>>> Trying to use that was pretty pleasant to use so I think we should
>>>>>> continue down that path. If you'd like to see it in use, I reimplemented
>>>>>> map() and flatMap() in terms of this new pardo functionality.
>>>>>>
>>>>>> Code has been pushed to the branch/PR if you're interested in taking
>>>>>> a look.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis <byronel...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Gotcha, I think there's a fairly easy solution to link input and
>>>>>>> output streams.... Let me try it out... might even be possible to have 
>>>>>>> both
>>>>>>> element and stream-wise closure pardos. Definitely possible to have 
>>>>>>> that at
>>>>>>> the DoFn level (called SerializableFn in the SDK because I want to
>>>>>>> use @DoFn as a macro)
>>>>>>>
>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw <rober...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath <
>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw <
>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> I would like to figure out a way to get the stream-y interface to
>>>>>>>>>> work, as I think it's more natural overall.
>>>>>>>>>>
>>>>>>>>>> One hypothesis is that if any elements are carried over loop
>>>>>>>>>> iterations, there will likely be some that are carried over beyond 
>>>>>>>>>> the loop
>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to 
>>>>>>>>>> end). We
>>>>>>>>>> could reject "plain" elements that are emitted after this point, 
>>>>>>>>>> requiring
>>>>>>>>>> one to emit timestamp-windowed-values.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are you assuming that the same stream (or overlapping sets of
>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of data
>>>>>>>>> streamed here are the data that belong to the current bundle (hence 
>>>>>>>>> already
>>>>>>>>> assigned to the current worker) so any output from the current bundle
>>>>>>>>> invocation would be a valid output of that bundle.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>> Yes, the content of the stream is exactly the contents of the
>>>>>>>> bundle. The question is how to do the input_element:output_element
>>>>>>>> correlation for automatically propagating metadata.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Related to this, we could enforce that the only (user-accessible)
>>>>>>>>>> way to get such a timestamped value is to start with one, e.g. a
>>>>>>>>>> WindowedValue<T>.withValue(O) produces a WindowedValue<O> with the 
>>>>>>>>>> same
>>>>>>>>>> metadata but a new value. Thus a user wanting to do anything "fancy" 
>>>>>>>>>> would
>>>>>>>>>> have to explicitly request iteration over these windowed values 
>>>>>>>>>> rather than
>>>>>>>>>> over the raw elements. (This is also forward compatible with 
>>>>>>>>>> expanding the
>>>>>>>>>> metadata that can get attached, e.g. pane infos, and makes the right 
>>>>>>>>>> thing
>>>>>>>>>> the easiest/most natural.)
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis <
>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ah, that is a good point—being element-wise would make managing
>>>>>>>>>>> windows and time stamps easier for the user. Fortunately it’s a 
>>>>>>>>>>> fairly easy
>>>>>>>>>>> change to make and maybe even less typing for the user. I was 
>>>>>>>>>>> originally
>>>>>>>>>>> thinking side inputs and metrics would happen outside the loop, but 
>>>>>>>>>>> I think
>>>>>>>>>>> you want a class and not a closure at that point for sanity.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw <
>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ah, I see.
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole bundle
>>>>>>>>>>>> rather than start/finish bundle callbacks, but one of the 
>>>>>>>>>>>> questions is how
>>>>>>>>>>>> that would impact implicit passing of the timestamp (and other) 
>>>>>>>>>>>> metadata
>>>>>>>>>>>> from input elements to output elements. (You can of course attach 
>>>>>>>>>>>> the
>>>>>>>>>>>> metadata to any output that happens in the loop body, but it's 
>>>>>>>>>>>> very easy to
>>>>>>>>>>>> implicitly to break the 1:1 relationship here (e.g. by doing 
>>>>>>>>>>>> buffering or
>>>>>>>>>>>> otherwise modifying local state) and this would be hard to detect. 
>>>>>>>>>>>> (I
>>>>>>>>>>>> suppose trying to output after the loop finishes could require
>>>>>>>>>>>> something more explicit).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis <
>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise
>>>>>>>>>>>>> collection operations like "map" that eliminate the need for 
>>>>>>>>>>>>> pardo in many
>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey under 
>>>>>>>>>>>>> the hood.
>>>>>>>>>>>>> That was to be more consistent with Swift's collection protocol 
>>>>>>>>>>>>> (and is
>>>>>>>>>>>>> also why PCollection and PCollectionStream are different types...
>>>>>>>>>>>>> PCollection implements map and friends as pipeline construction 
>>>>>>>>>>>>> operations
>>>>>>>>>>>>> whereas PCollectionStream is an actual stream)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses map
>>>>>>>>>>>>> rather than pardo in a couple of places to do a true wordcount 
>>>>>>>>>>>>> using good
>>>>>>>>>>>>> ol' Shakespeare and very very primitive GCS IO.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> B
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <
>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo syntax
>>>>>>>>>>>>>> quite a bit before settling on where I ended up. Ultimately I 
>>>>>>>>>>>>>> decided to go
>>>>>>>>>>>>>> with something that felt more Swift-y than anything else which 
>>>>>>>>>>>>>> means that
>>>>>>>>>>>>>> rather than dealing with a single element like you do in the 
>>>>>>>>>>>>>> other SDKs
>>>>>>>>>>>>>> you're dealing with a stream of elements (which of course will 
>>>>>>>>>>>>>> often be of
>>>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world 
>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>> with the async / await structures. So when you see something 
>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> filenames is the input stream and then output and errors are
>>>>>>>>>>>>>> both output streams. In theory you can have as many output 
>>>>>>>>>>>>>> streams as you
>>>>>>>>>>>>>> like though at the moment there's a compiler bug in the new type 
>>>>>>>>>>>>>> pack
>>>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>>>>> Presumably
>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you had parameterization you wanted to send that would
>>>>>>>>>>>>>> look like pardo("Parameter") { param,filenames,output,error in 
>>>>>>>>>>>>>> ... } where
>>>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is 
>>>>>>>>>>>>>> being
>>>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you
>>>>>>>>>>>>>> have in ES6 and other things where "_" is Swift for "ignore." In 
>>>>>>>>>>>>>> this case
>>>>>>>>>>>>>> PCollectionStreams have an element signature of (Of,Date,Window) 
>>>>>>>>>>>>>> so you can
>>>>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>>>>> manipulate
>>>>>>>>>>>>>> it somehow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>>>>> pardos--- that would probably mean having explicit type 
>>>>>>>>>>>>>> signatures in the
>>>>>>>>>>>>>> closure. I had that at one point, but it felt less natural the 
>>>>>>>>>>>>>> more I used
>>>>>>>>>>>>>> it. I'm also slowly working towards adding a more "traditional" 
>>>>>>>>>>>>>> DoFn
>>>>>>>>>>>>>> implementation approach where you implement the DoFn as an 
>>>>>>>>>>>>>> object type. In
>>>>>>>>>>>>>> that case it would be very very easy to support both by having a 
>>>>>>>>>>>>>> default
>>>>>>>>>>>>>> stream implementation call the equivalent of processElement. To 
>>>>>>>>>>>>>> make that
>>>>>>>>>>>>>> performant I need to implement an @DoFn macro and I just haven't 
>>>>>>>>>>>>>> gotten to
>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing implementing
>>>>>>>>>>>>>> composite and external transforms for the reasons you suggest. 
>>>>>>>>>>>>>> :-) I've got
>>>>>>>>>>>>>> the basics of a composite transform (there's an equivalent 
>>>>>>>>>>>>>> wordcount
>>>>>>>>>>>>>> example) and am hooking it into the pipeline generation, which 
>>>>>>>>>>>>>> should also
>>>>>>>>>>>>>> give me everything I need to successfully hook in external 
>>>>>>>>>>>>>> transforms as
>>>>>>>>>>>>>> well. That will give me the jump on IOs as you say. I can also 
>>>>>>>>>>>>>> treat the
>>>>>>>>>>>>>> pipeline itself as a composite transform which lets me get rid 
>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>> Pipeline { pipeline in ... } and just instead have things attach 
>>>>>>>>>>>>>> themselves
>>>>>>>>>>>>>> to the pipeline implicitly.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said, there are some interesting IO possibilities that
>>>>>>>>>>>>>> would be Swift native. In particularly, I've been looking at the 
>>>>>>>>>>>>>> native
>>>>>>>>>>>>>> Swift binding for DuckDB (which is C++ based). DuckDB is SQL 
>>>>>>>>>>>>>> based but not
>>>>>>>>>>>>>> distributed in the same was as, say, Beam SQL... but it would 
>>>>>>>>>>>>>> allow for SQL
>>>>>>>>>>>>>> statements on individual files with projection pushdown 
>>>>>>>>>>>>>> supported for
>>>>>>>>>>>>>> things like Parquet which could have some cool and performant 
>>>>>>>>>>>>>> data lake
>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that 
>>>>>>>>>>>>>> would give
>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes it 
>>>>>>>>>>>>>> pretty easy to
>>>>>>>>>>>>>> work with GCS.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute here
>>>>>>>>>>>>>> and there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Neat.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how the
>>>>>>>>>>>>>>> FnAPI works :). I like the use of groupBy. I have to admit I'm 
>>>>>>>>>>>>>>> a bit
>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at all 
>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>> probably tripping me up). The addition of external 
>>>>>>>>>>>>>>> (cross-language)
>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) pretty 
>>>>>>>>>>>>>>> quickly from
>>>>>>>>>>>>>>> other SDKs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet
>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that need 
>>>>>>>>>>>>>>>> to properly
>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed values 
>>>>>>>>>>>>>>>> and length
>>>>>>>>>>>>>>>> in iterable coders as those both use specifically bigendian 
>>>>>>>>>>>>>>>> representations)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <
>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks Cham,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like since 
>>>>>>>>>>>>>>>>> most of the LOC
>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I definitely 
>>>>>>>>>>>>>>>>> want to add
>>>>>>>>>>>>>>>>> external transforms and may actually add that support before 
>>>>>>>>>>>>>>>>> adding the
>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With the 
>>>>>>>>>>>>>>>>> way the SDK is
>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a 
>>>>>>>>>>>>>>>>> separate operation
>>>>>>>>>>>>>>>>> than defining a composite.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there is
>>>>>>>>>>>>>>>>>> interest in Swift SDK from folks currently subscribed to the
>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to really
>>>>>>>>>>>>>>>>>>> understand how the Beam FnApi works and how it interacts 
>>>>>>>>>>>>>>>>>>> with the Portable
>>>>>>>>>>>>>>>>>>> Runner. For me at least that usually means I need to write 
>>>>>>>>>>>>>>>>>>> some code so I
>>>>>>>>>>>>>>>>>>> can see things happening in a debugger and to really prove 
>>>>>>>>>>>>>>>>>>> to myself I
>>>>>>>>>>>>>>>>>>> understood what was going on I decided I couldn't use an 
>>>>>>>>>>>>>>>>>>> existing SDK
>>>>>>>>>>>>>>>>>>> language to do it since there would be the temptation to 
>>>>>>>>>>>>>>>>>>> read some code and
>>>>>>>>>>>>>>>>>>> convince myself that I actually understood what was going 
>>>>>>>>>>>>>>>>>>> on.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to get a
>>>>>>>>>>>>>>>>>>> minimal FnApi integration going you end up writing a fair 
>>>>>>>>>>>>>>>>>>> bit of an SDK. So
>>>>>>>>>>>>>>>>>>> I decided to take things to a point where I had an SDK that 
>>>>>>>>>>>>>>>>>>> could execute a
>>>>>>>>>>>>>>>>>>> word count example via a portable runner backend. I've now 
>>>>>>>>>>>>>>>>>>> reached that
>>>>>>>>>>>>>>>>>>> point and would like to submit my prototype SDK to the list 
>>>>>>>>>>>>>>>>>>> for feedback.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode Beta
>>>>>>>>>>>>>>>>>>> using Swift 5.9 on Intel Macs, but should also work using 
>>>>>>>>>>>>>>>>>>> beta builds of
>>>>>>>>>>>>>>>>>>> 5.9 for Linux running on Intel hardware. I haven't had a 
>>>>>>>>>>>>>>>>>>> chance to try it
>>>>>>>>>>>>>>>>>>> on ARM hardware and make sure all of the endian checks are 
>>>>>>>>>>>>>>>>>>> complete. The
>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count example 
>>>>>>>>>>>>>>>>>>> that reads some
>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ 
>>>>>>>>>>>>>>>>>>> functionality) and
>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations to 
>>>>>>>>>>>>>>>>>>> get it past the
>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against the 
>>>>>>>>>>>>>>>>>>> Python Portable
>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no Direct 
>>>>>>>>>>>>>>>>>>> Runner at this
>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and
>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for example 
>>>>>>>>>>>>>>>>>>> pardo was
>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In general 
>>>>>>>>>>>>>>>>>>> I've tried to
>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy 
>>>>>>>>>>>>>>>>>>> reliance on closures
>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms there's 
>>>>>>>>>>>>>>>>>>> the beginnings of
>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative API for 
>>>>>>>>>>>>>>>>>>> creating them.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to be
>>>>>>>>>>>>>>>>>>> implemented, like counters, metrics, windowing, state, 
>>>>>>>>>>>>>>>>>>> timers, etc.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This should be fine and we can get the code documented
>>>>>>>>>>>>>>>>>> without these features. I think support for composites and 
>>>>>>>>>>>>>>>>>> adding an
>>>>>>>>>>>>>>>>>> external transform (see, Java
>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>>>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>>>>>>>>>>>>>>>> Go
>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>>>>>>>>>>>>>>>> TypeScript
>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of 
>>>>>>>>>>>>>>>>>> features (for example,
>>>>>>>>>>>>>>>>>> I/O connectors) for free.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a PR if
>>>>>>>>>>>>>>>>>>> folks are interested, though the "Swift Way" would be to 
>>>>>>>>>>>>>>>>>>> have it in its own
>>>>>>>>>>>>>>>>>>> repo so that it can easily be used from the Swift Package 
>>>>>>>>>>>>>>>>>>> Manager.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially). Also
>>>>>>>>>>>>>>>>>> it'll be easier to comment on a PR :)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> - Cham
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

Reply via email to