Re: [Request for Feedback] Swift SDK Prototype

Chamikara Jayalath via user Wed, 20 Sep 2023 11:09:28 -0700

On Wed, Sep 20, 2023 at 10:48 AM Danny McCormick <dannymccorm...@google.com>
wrote:


> > I think the process should be similar to other code/design reviews for
> large contributions. I don't think you need a PMC involvement here.
>
> I think it does require PMC involvement to create the actual repo once we
> have public consensus. I tried the flow at
> https://infra.apache.org/version-control.html#create but it seems like
> its PMC only. It's unclear to me if consensus has been achieved, maybe a
> dedicated voting thread with implied lazy consensus would help here.
>

Yeah, it seems like a PMC member needs to create the repo.


>
> > Sure, we could definitely include things as a submodule for stuff like
> testing multi-language, though I think there's actually a cleaner way just
> using the Swift package manager's test facilities to access the swift sdk
> repo.
>
> +1 on avoiding submodules. If needed we could also use multi-repo checkout
> with GitHub Actions. I think my biggest question is what we'd actually be
> enforcing though. In general, I'd expect the normal update flow to be
>
> 1) Update Beam protos and/or multi-lang components (though the set of
> things that needs updated for multi-lang is unclear to me)
>

Regarding multi-lang, the protocol does not require consistent versioning
but we may need testing to make sure things work consistently/correctly
when used from a released Swift SDK. For example, Python multi-lang
wrappers look for a Java version with the same version number as the Python
SDK being used.


> 2) Mirror those changes to the Swift SDK.
>
> The thing that is most likely to be forgotten is the 2nd step, and that is
> hard to enforce with automation since the automation would either be on the
> first step which doesn't have anything to enforce or on some sort of
> schedule in the swift repo, which is less likely to be visible. I'm a
> little worried we wouldn't notice breakages until release time.
>
> I wonder how much stuff happens outside of the proto directory that needs
> to be mirrored. Could we just create scheduled automation to exactly copy
> changes in the proto directory and version changes for multi-lang stuff to
> the swift SDK repo?
>
> ---------------------------------------------------------------------
>
> Regardless, I'm +1 on a dedicated repo; I'd rather we take on some
> organizational weirdness than push that pain to users.
>
> Thanks,
> Danny
>
> On Wed, Sep 20, 2023 at 1:38 PM Byron Ellis via user <user@beam.apache.org>
> wrote:
>
>> Sure, we could definitely include things as a submodule for stuff like
>> testing multi-language, though I think there's actually a cleaner way just
>> using the Swift package manager's test facilities to access the swift sdk
>> repo.
>>
>>  That would also be consistent with the user-side experience and let us
>> test things like build-time integrations with multi-language as well (which
>> is possible in Swift through compiler plugins) in the same way as a
>> pipeline author would. You also maybe get backwards compatibility testing
>> as a side effect in that case as well.
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 20, 2023 at 10:20 AM Chamikara Jayalath <chamik...@google.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Wed, Sep 20, 2023 at 9:54 AM Byron Ellis <byronel...@google.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I've chatted with a couple of people offline about this and my
>>>> impression is that folks are generally amenable to a separate repo to match
>>>> the target community? I have no idea what the next steps would be though
>>>> other than guessing that there's probably some sort of PMC thing involved?
>>>> Should I write something up somewhere?
>>>>
>>>
>>> I think the process should be similar to other code/design reviews for
>>> large contributions. I don't think you need a PMC involvement here.
>>>
>>>
>>>>
>>>> Best,
>>>> B
>>>>
>>>> On Thu, Sep 14, 2023 at 9:00 AM Byron Ellis <byronel...@google.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've been on vacation, but mostly working on getting External
>>>>> Transform support going (which in turn basically requires Schema support 
>>>>> as
>>>>> well). It also looks like macros landed in Swift 5.9 for Linux so we'll be
>>>>> able to use those to do some compile-time automation. In particular, this
>>>>> lets us do something similar to what Java does with ByteBuddy for
>>>>> generating schema coders though it has to be ahead of time so not quite 
>>>>> the
>>>>> same. (As far as I can tell this is a reason why macros got added to the
>>>>> language in the first place---Apple's SwiftData library makes heavy use of
>>>>> the feature).
>>>>>
>>>>> I do have one question for the group though: should the Swift SDK
>>>>> distribution take on Beam community properties or Swift community
>>>>> properties? Specifically, in the Swift world the Swift SDK would live in
>>>>> its own repo (beam-swift for example), which allows it to be most easily
>>>>> consumed and keeps the checkout size under control for users. "Releases" 
>>>>> in
>>>>> the Swift world (much like Go) are just repo tags. The downside here is
>>>>> that there's overhead in setting up the various github actions and other
>>>>> CI/CD bits and bobs.
>>>>>
>>>>>
>>>
>>>> The alternative would be to keep it in the beam repo itself like it is
>>>>> now, but we'd probably want to move Package.swift to the root since for
>>>>> whatever reason the Swift community (much to some people's annoyance) has
>>>>> chosen to have packages only really able to live at the top of a repo. 
>>>>> This
>>>>> has less overhead from a CI/CD perspective, but lots of overhead for users
>>>>> as they'd be checking out the entire Beam repo to use the SDK, which
>>>>> happens a lot.
>>>>>
>>>>> There's a third option which is basically "do both" but honestly that
>>>>> just seems like the worst of both worlds as it would require constant
>>>>> syncing if we wanted to make it possible for Swift users to target
>>>>> unreleased SDKs for development and testing.
>>>>>
>>>>> Personally, I would lean towards the former option (and would
>>>>> volunteer to set up & document the various automations) as it is lighter
>>>>> for the actual users of the SDK and more consistent with the community
>>>>> experience they expect. The CI/CD stuff is mostly a "do it once" whereas
>>>>> checking out the entire repo with many updates the user doesn't care about
>>>>> is something they will be doing all the time. FWIW some of our 
>>>>> dependencies
>>>>> also chose this route---most notably GRPC which started with the latter
>>>>> approach and has moved to the former.
>>>>>
>>>>
>>> I believe existing SDKs benefit from living in the same repo. For
>>> example, it's easier to keep them consistent with any model/proto changes
>>> and it's easier to manage distributions/tags. Also it's easier to keep
>>> components consistent for multi-lang. If we add Swift to a separate repo,
>>> we'll probably have to add tooling/scripts to keep things consistent.
>>> Is it possible to create a separate repo, but also add a reference (and
>>> Gradle tasks) under "beam/sdks/swift" so that we can add Beam tests to make
>>> sure that things stay consistent ?
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>>>
>>>>> Interested to hear any feedback on the subject since I'm guessing it
>>>>> probably came up with the Go SDK back in the day?
>>>>>
>>>>> Best,
>>>>> B
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Aug 29, 2023 at 7:59 AM Byron Ellis <byronel...@google.com>
>>>>> wrote:
>>>>>
>>>>>> After a couple of iterations (thanks rebo!) we've also gotten the
>>>>>> Swift SDK working with the new Prism runner. The fact that it doesn't do
>>>>>> fusion caught a couple of configuration bugs (e.g. that the grpc message
>>>>>> receiver buffer should be fairly large). It would seem that at the moment
>>>>>> Prism and the Flink runner have similar orders of strictness when
>>>>>> interpreting the pipeline graph while the Python portable runner is far
>>>>>> more forgiving.
>>>>>>
>>>>>> Also added support for bounded vs unbounded pcollections through the
>>>>>> "type" parameter when adding a pardo. Impulse is a bounded pcollection I
>>>>>> believe?
>>>>>>
>>>>>> On Fri, Aug 25, 2023 at 2:04 PM Byron Ellis <byronel...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay, after a brief detour through "get this working in the Flink
>>>>>>> Portable Runner" I think I have something pretty workable.
>>>>>>>
>>>>>>> PInput and POutput can actually be structs rather than protocols,
>>>>>>> which simplifies things quite a bit. It also allows us to use them with
>>>>>>> property wrappers for a SwiftUI-like experience if we want when defining
>>>>>>> DoFns (which is what I was originally intending to use them for). That 
>>>>>>> also
>>>>>>> means the function signature you use for closures would match 
>>>>>>> full-fledged
>>>>>>> DoFn definitions for the most part which is satisfying.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 24, 2023 at 5:55 PM Byron Ellis <byronel...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Okay, I tried a couple of different things.
>>>>>>>>
>>>>>>>> Implicitly passing the timestamp and window during iteration did
>>>>>>>> not go well. While physically possible it introduces an invisible side
>>>>>>>> effect into loop iteration which confused me when I tried to use it 
>>>>>>>> and I
>>>>>>>> implemented it. Also, I'm pretty sure there'd end up being some sort of
>>>>>>>> race condition nightmare continuing down that path.
>>>>>>>>
>>>>>>>> What I decided to do instead was the following:
>>>>>>>>
>>>>>>>> 1. Rename the existing "pardo" functions to "pstream" and require
>>>>>>>> that they always emit a window and timestamp along with their value. 
>>>>>>>> This
>>>>>>>> eliminates the side effect but lets us keep iteration in a bundle where
>>>>>>>> that might be convenient. For example, in my cheesy GCS implementation 
>>>>>>>> it
>>>>>>>> means that I can keep an OAuth token around for the lifetime of the 
>>>>>>>> bundle
>>>>>>>> as a local variable, which is convenient. It's a bit more typing for 
>>>>>>>> users
>>>>>>>> of pstream, but the expectation here is that if you're using pstream
>>>>>>>> functions You Know What You Are Doing and most people won't be using it
>>>>>>>> directly.
>>>>>>>>
>>>>>>>> 2. Introduce a new set of pardo functions (I didn't do all of them
>>>>>>>> yet, but enough to test the functionality and decide I liked it) which 
>>>>>>>> take
>>>>>>>> a function signature of (any PInput<InputType>,any 
>>>>>>>> POutput<OutputType>).
>>>>>>>> PInput takes the (InputType,Date,Window) tuple and converts it into a
>>>>>>>> struct with friendlier names. Not strictly necessary, but makes the 
>>>>>>>> code
>>>>>>>> nicer to read I think. POutput introduces emit functions that 
>>>>>>>> optionally
>>>>>>>> allow you to specify a timestamp and a window. If you don't for either 
>>>>>>>> one
>>>>>>>> it will take the timestamp and/or window of the input.
>>>>>>>>
>>>>>>>> Trying to use that was pretty pleasant to use so I think we should
>>>>>>>> continue down that path. If you'd like to see it in use, I 
>>>>>>>> reimplemented
>>>>>>>> map() and flatMap() in terms of this new pardo functionality.
>>>>>>>>
>>>>>>>> Code has been pushed to the branch/PR if you're interested in
>>>>>>>> taking a look.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 24, 2023 at 2:15 PM Byron Ellis <byronel...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Gotcha, I think there's a fairly easy solution to link input and
>>>>>>>>> output streams.... Let me try it out... might even be possible to 
>>>>>>>>> have both
>>>>>>>>> element and stream-wise closure pardos. Definitely possible to have 
>>>>>>>>> that at
>>>>>>>>> the DoFn level (called SerializableFn in the SDK because I want to
>>>>>>>>> use @DoFn as a macro)
>>>>>>>>>
>>>>>>>>> On Thu, Aug 24, 2023 at 1:09 PM Robert Bradshaw <
>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Thu, Aug 24, 2023 at 12:58 PM Chamikara Jayalath <
>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:27 PM Robert Bradshaw <
>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I would like to figure out a way to get the stream-y interface
>>>>>>>>>>>> to work, as I think it's more natural overall.
>>>>>>>>>>>>
>>>>>>>>>>>> One hypothesis is that if any elements are carried over loop
>>>>>>>>>>>> iterations, there will likely be some that are carried over beyond 
>>>>>>>>>>>> the loop
>>>>>>>>>>>> (after all the callee doesn't know when the loop is supposed to 
>>>>>>>>>>>> end). We
>>>>>>>>>>>> could reject "plain" elements that are emitted after this point, 
>>>>>>>>>>>> requiring
>>>>>>>>>>>> one to emit timestamp-windowed-values.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Are you assuming that the same stream (or overlapping sets of
>>>>>>>>>>> data) are pushed to multiple workers ? I thought that the set of 
>>>>>>>>>>> data
>>>>>>>>>>> streamed here are the data that belong to the current bundle (hence 
>>>>>>>>>>> already
>>>>>>>>>>> assigned to the current worker) so any output from the current 
>>>>>>>>>>> bundle
>>>>>>>>>>> invocation would be a valid output of that bundle.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> Yes, the content of the stream is exactly the contents of the
>>>>>>>>>> bundle. The question is how to do the input_element:output_element
>>>>>>>>>> correlation for automatically propagating metadata.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Related to this, we could enforce that the only
>>>>>>>>>>>> (user-accessible) way to get such a timestamped value is to start 
>>>>>>>>>>>> with one,
>>>>>>>>>>>> e.g. a WindowedValue<T>.withValue(O) produces a WindowedValue<O> 
>>>>>>>>>>>> with the
>>>>>>>>>>>> same metadata but a new value. Thus a user wanting to do anything 
>>>>>>>>>>>> "fancy"
>>>>>>>>>>>> would have to explicitly request iteration over these windowed 
>>>>>>>>>>>> values
>>>>>>>>>>>> rather than over the raw elements. (This is also forward 
>>>>>>>>>>>> compatible with
>>>>>>>>>>>> expanding the metadata that can get attached, e.g. pane infos, and 
>>>>>>>>>>>> makes
>>>>>>>>>>>> the right thing the easiest/most natural.)
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:10 PM Byron Ellis <
>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ah, that is a good point—being element-wise would make
>>>>>>>>>>>>> managing windows and time stamps easier for the user. Fortunately 
>>>>>>>>>>>>> it’s a
>>>>>>>>>>>>> fairly easy change to make and maybe even less typing for the 
>>>>>>>>>>>>> user. I was
>>>>>>>>>>>>> originally thinking side inputs and metrics would happen outside 
>>>>>>>>>>>>> the loop,
>>>>>>>>>>>>> but I think you want a class and not a closure at that point for 
>>>>>>>>>>>>> sanity.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 24, 2023 at 12:02 PM Robert Bradshaw <
>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ah, I see.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yeah, I've thought about using an iterable for the whole
>>>>>>>>>>>>>> bundle rather than start/finish bundle callbacks, but one of the 
>>>>>>>>>>>>>> questions
>>>>>>>>>>>>>> is how that would impact implicit passing of the timestamp (and 
>>>>>>>>>>>>>> other)
>>>>>>>>>>>>>> metadata from input elements to output elements. (You can of 
>>>>>>>>>>>>>> course attach
>>>>>>>>>>>>>> the metadata to any output that happens in the loop body, but 
>>>>>>>>>>>>>> it's very
>>>>>>>>>>>>>> easy to implicitly to break the 1:1 relationship here (e.g. by 
>>>>>>>>>>>>>> doing
>>>>>>>>>>>>>> buffering or otherwise modifying local state) and this would be 
>>>>>>>>>>>>>> hard to
>>>>>>>>>>>>>> detect. (I suppose trying to output after the loop finishes 
>>>>>>>>>>>>>> could require
>>>>>>>>>>>>>> something more explicit).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:56 PM Byron Ellis <
>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Oh, I also forgot to mention that I included element-wise
>>>>>>>>>>>>>>> collection operations like "map" that eliminate the need for 
>>>>>>>>>>>>>>> pardo in many
>>>>>>>>>>>>>>> cases. the groupBy command is actually a map + groupByKey under 
>>>>>>>>>>>>>>> the hood.
>>>>>>>>>>>>>>> That was to be more consistent with Swift's collection protocol 
>>>>>>>>>>>>>>> (and is
>>>>>>>>>>>>>>> also why PCollection and PCollectionStream are different 
>>>>>>>>>>>>>>> types...
>>>>>>>>>>>>>>> PCollection implements map and friends as pipeline construction 
>>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>> whereas PCollectionStream is an actual stream)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just happened to push some "IO primitives" that uses map
>>>>>>>>>>>>>>> rather than pardo in a couple of places to do a true wordcount 
>>>>>>>>>>>>>>> using good
>>>>>>>>>>>>>>> ol' Shakespeare and very very primitive GCS IO.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 6:08 PM Byron Ellis <
>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Indeed :-) Yeah, I went back and forth on the pardo syntax
>>>>>>>>>>>>>>>> quite a bit before settling on where I ended up. Ultimately I 
>>>>>>>>>>>>>>>> decided to go
>>>>>>>>>>>>>>>> with something that felt more Swift-y than anything else which 
>>>>>>>>>>>>>>>> means that
>>>>>>>>>>>>>>>> rather than dealing with a single element like you do in the 
>>>>>>>>>>>>>>>> other SDKs
>>>>>>>>>>>>>>>> you're dealing with a stream of elements (which of course will 
>>>>>>>>>>>>>>>> often be of
>>>>>>>>>>>>>>>> size 1). That's a really natural paradigm in the Swift world 
>>>>>>>>>>>>>>>> especially
>>>>>>>>>>>>>>>> with the async / await structures. So when you see something 
>>>>>>>>>>>>>>>> like:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pardo(name:"Read Files") { filenames,output,errors in
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for try await (filename,_,_) in filenames {
>>>>>>>>>>>>>>>>   ...
>>>>>>>>>>>>>>>>   output.emit(data)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> filenames is the input stream and then output and errors
>>>>>>>>>>>>>>>> are both output streams. In theory you can have as many output 
>>>>>>>>>>>>>>>> streams as
>>>>>>>>>>>>>>>> you like though at the moment there's a compiler bug in the 
>>>>>>>>>>>>>>>> new type pack
>>>>>>>>>>>>>>>> feature that limits it to "as many as I felt like supporting". 
>>>>>>>>>>>>>>>> Presumably
>>>>>>>>>>>>>>>> this will get fixed before the official 5.9 release which will 
>>>>>>>>>>>>>>>> probably be
>>>>>>>>>>>>>>>> in the October timeframe if history is any guide)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you had parameterization you wanted to send that would
>>>>>>>>>>>>>>>> look like pardo("Parameter") { param,filenames,output,error in 
>>>>>>>>>>>>>>>> ... } where
>>>>>>>>>>>>>>>> "param" would take on the value of "Parameter." All of this is 
>>>>>>>>>>>>>>>> being
>>>>>>>>>>>>>>>> typechecked at compile time BTW.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the (filename,_,_) is a tuple spreading construct like you
>>>>>>>>>>>>>>>> have in ES6 and other things where "_" is Swift for "ignore." 
>>>>>>>>>>>>>>>> In this case
>>>>>>>>>>>>>>>> PCollectionStreams have an element signature of 
>>>>>>>>>>>>>>>> (Of,Date,Window) so you can
>>>>>>>>>>>>>>>> optionally extract the timestamp and the window if you want to 
>>>>>>>>>>>>>>>> manipulate
>>>>>>>>>>>>>>>> it somehow.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That said it would also be natural to provide elementwise
>>>>>>>>>>>>>>>> pardos--- that would probably mean having explicit type 
>>>>>>>>>>>>>>>> signatures in the
>>>>>>>>>>>>>>>> closure. I had that at one point, but it felt less natural the 
>>>>>>>>>>>>>>>> more I used
>>>>>>>>>>>>>>>> it. I'm also slowly working towards adding a more 
>>>>>>>>>>>>>>>> "traditional" DoFn
>>>>>>>>>>>>>>>> implementation approach where you implement the DoFn as an 
>>>>>>>>>>>>>>>> object type. In
>>>>>>>>>>>>>>>> that case it would be very very easy to support both by having 
>>>>>>>>>>>>>>>> a default
>>>>>>>>>>>>>>>> stream implementation call the equivalent of processElement. 
>>>>>>>>>>>>>>>> To make that
>>>>>>>>>>>>>>>> performant I need to implement an @DoFn macro and I just 
>>>>>>>>>>>>>>>> haven't gotten to
>>>>>>>>>>>>>>>> it yet.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's a bit more work and I've been prioritizing
>>>>>>>>>>>>>>>> implementing composite and external transforms for the reasons 
>>>>>>>>>>>>>>>> you suggest.
>>>>>>>>>>>>>>>> :-) I've got the basics of a composite transform (there's an 
>>>>>>>>>>>>>>>> equivalent
>>>>>>>>>>>>>>>> wordcount example) and am hooking it into the pipeline 
>>>>>>>>>>>>>>>> generation, which
>>>>>>>>>>>>>>>> should also give me everything I need to successfully hook in 
>>>>>>>>>>>>>>>> external
>>>>>>>>>>>>>>>> transforms as well. That will give me the jump on IOs as you 
>>>>>>>>>>>>>>>> say. I can
>>>>>>>>>>>>>>>> also treat the pipeline itself as a composite transform which 
>>>>>>>>>>>>>>>> lets me get
>>>>>>>>>>>>>>>> rid of the Pipeline { pipeline in ... } and just instead have 
>>>>>>>>>>>>>>>> things attach
>>>>>>>>>>>>>>>> themselves to the pipeline implicitly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That said, there are some interesting IO possibilities that
>>>>>>>>>>>>>>>> would be Swift native. In particularly, I've been looking at 
>>>>>>>>>>>>>>>> the native
>>>>>>>>>>>>>>>> Swift binding for DuckDB (which is C++ based). DuckDB is SQL 
>>>>>>>>>>>>>>>> based but not
>>>>>>>>>>>>>>>> distributed in the same was as, say, Beam SQL... but it would 
>>>>>>>>>>>>>>>> allow for SQL
>>>>>>>>>>>>>>>> statements on individual files with projection pushdown 
>>>>>>>>>>>>>>>> supported for
>>>>>>>>>>>>>>>> things like Parquet which could have some cool and performant 
>>>>>>>>>>>>>>>> data lake
>>>>>>>>>>>>>>>> applications. I'll probably do a couple of the simpler IOs as
>>>>>>>>>>>>>>>> well---there's a Swift AWS SDK binding that's pretty good that 
>>>>>>>>>>>>>>>> would give
>>>>>>>>>>>>>>>> me S3 and there's a Cloud auth library as well that makes it 
>>>>>>>>>>>>>>>> pretty easy to
>>>>>>>>>>>>>>>> work with GCS.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In any case, I'm updating the branch as I find a minute
>>>>>>>>>>>>>>>> here and there.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Aug 23, 2023 at 5:02 PM Robert Bradshaw <
>>>>>>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Neat.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Nothing like writing and SDK to actually understand how
>>>>>>>>>>>>>>>>> the FnAPI works :). I like the use of groupBy. I have to 
>>>>>>>>>>>>>>>>> admit I'm a bit
>>>>>>>>>>>>>>>>> mystified by the syntax for parDo (I don't know swift at all 
>>>>>>>>>>>>>>>>> which is
>>>>>>>>>>>>>>>>> probably tripping me up). The addition of external 
>>>>>>>>>>>>>>>>> (cross-language)
>>>>>>>>>>>>>>>>> transforms could let you steal everything (e.g. IOs) pretty 
>>>>>>>>>>>>>>>>> quickly from
>>>>>>>>>>>>>>>>> other SDKs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Aug 18, 2023 at 7:55 AM Byron Ellis via user <
>>>>>>>>>>>>>>>>> user@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> For everyone who is interested, here's the draft PR:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/28062
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I haven't had a chance to test it on my M1 machine yet
>>>>>>>>>>>>>>>>>> though (there's a good chance there are a few places that 
>>>>>>>>>>>>>>>>>> need to properly
>>>>>>>>>>>>>>>>>> address endianness. Specifically timestamps in windowed 
>>>>>>>>>>>>>>>>>> values and length
>>>>>>>>>>>>>>>>>> in iterable coders as those both use specifically bigendian 
>>>>>>>>>>>>>>>>>> representations)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 8:57 PM Byron Ellis <
>>>>>>>>>>>>>>>>>> byronel...@google.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks Cham,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Definitely happy to open a draft PR so folks can
>>>>>>>>>>>>>>>>>>> comment---there's not as much code as it looks like since 
>>>>>>>>>>>>>>>>>>> most of the LOC
>>>>>>>>>>>>>>>>>>> is just generated protobuf. As for the support, I 
>>>>>>>>>>>>>>>>>>> definitely want to add
>>>>>>>>>>>>>>>>>>> external transforms and may actually add that support 
>>>>>>>>>>>>>>>>>>> before adding the
>>>>>>>>>>>>>>>>>>> ability to make composites in the language itself. With the 
>>>>>>>>>>>>>>>>>>> way the SDK is
>>>>>>>>>>>>>>>>>>> laid out adding composites to the pipeline graph is a 
>>>>>>>>>>>>>>>>>>> separate operation
>>>>>>>>>>>>>>>>>>> than defining a composite.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Aug 17, 2023 at 4:28 PM Chamikara Jayalath <
>>>>>>>>>>>>>>>>>>> chamik...@google.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks Byron. This sounds great. I wonder if there is
>>>>>>>>>>>>>>>>>>>> interest in Swift SDK from folks currently subscribed to 
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> +user <user@beam.apache.org> list.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Aug 16, 2023 at 6:53 PM Byron Ellis via dev <
>>>>>>>>>>>>>>>>>>>> d...@beam.apache.org> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hello everyone,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> A couple of months ago I decided that I wanted to
>>>>>>>>>>>>>>>>>>>>> really understand how the Beam FnApi works and how it 
>>>>>>>>>>>>>>>>>>>>> interacts with the
>>>>>>>>>>>>>>>>>>>>> Portable Runner. For me at least that usually means I 
>>>>>>>>>>>>>>>>>>>>> need to write some
>>>>>>>>>>>>>>>>>>>>> code so I can see things happening in a debugger and to 
>>>>>>>>>>>>>>>>>>>>> really prove to
>>>>>>>>>>>>>>>>>>>>> myself I understood what was going on I decided I 
>>>>>>>>>>>>>>>>>>>>> couldn't use an existing
>>>>>>>>>>>>>>>>>>>>> SDK language to do it since there would be the temptation 
>>>>>>>>>>>>>>>>>>>>> to read some code
>>>>>>>>>>>>>>>>>>>>> and convince myself that I actually understood what was 
>>>>>>>>>>>>>>>>>>>>> going on.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> One thing led to another and it turns out that to get
>>>>>>>>>>>>>>>>>>>>> a minimal FnApi integration going you end up writing a 
>>>>>>>>>>>>>>>>>>>>> fair bit of an SDK.
>>>>>>>>>>>>>>>>>>>>> So I decided to take things to a point where I had an SDK 
>>>>>>>>>>>>>>>>>>>>> that could
>>>>>>>>>>>>>>>>>>>>> execute a word count example via a portable runner 
>>>>>>>>>>>>>>>>>>>>> backend. I've now
>>>>>>>>>>>>>>>>>>>>> reached that point and would like to submit my prototype 
>>>>>>>>>>>>>>>>>>>>> SDK to the list
>>>>>>>>>>>>>>>>>>>>> for feedback.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> It's currently living in a branch on my fork here:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/byronellis/beam/tree/swift-sdk/sdks/swift
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> At the moment it runs via the most recent XCode Beta
>>>>>>>>>>>>>>>>>>>>> using Swift 5.9 on Intel Macs, but should also work using 
>>>>>>>>>>>>>>>>>>>>> beta builds of
>>>>>>>>>>>>>>>>>>>>> 5.9 for Linux running on Intel hardware. I haven't had a 
>>>>>>>>>>>>>>>>>>>>> chance to try it
>>>>>>>>>>>>>>>>>>>>> on ARM hardware and make sure all of the endian checks 
>>>>>>>>>>>>>>>>>>>>> are complete. The
>>>>>>>>>>>>>>>>>>>>> "IntegrationTests.swift" file contains a word count 
>>>>>>>>>>>>>>>>>>>>> example that reads some
>>>>>>>>>>>>>>>>>>>>> local files (as well as a missing file to exercise DLQ 
>>>>>>>>>>>>>>>>>>>>> functionality) and
>>>>>>>>>>>>>>>>>>>>> output counts through two separate group by operations to 
>>>>>>>>>>>>>>>>>>>>> get it past the
>>>>>>>>>>>>>>>>>>>>> "map reduce" size of pipeline. I've tested it against the 
>>>>>>>>>>>>>>>>>>>>> Python Portable
>>>>>>>>>>>>>>>>>>>>> Runner. Since my goal was to learn FnApi there is no 
>>>>>>>>>>>>>>>>>>>>> Direct Runner at this
>>>>>>>>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I've shown it to a couple of folks already and
>>>>>>>>>>>>>>>>>>>>> incorporated some of that feedback already (for example 
>>>>>>>>>>>>>>>>>>>>> pardo was
>>>>>>>>>>>>>>>>>>>>> originally called dofn when defining pipelines). In 
>>>>>>>>>>>>>>>>>>>>> general I've tried to
>>>>>>>>>>>>>>>>>>>>> make the API as "Swift-y" as possible, hence the heavy 
>>>>>>>>>>>>>>>>>>>>> reliance on closures
>>>>>>>>>>>>>>>>>>>>> and while there aren't yet composite PTransforms there's 
>>>>>>>>>>>>>>>>>>>>> the beginnings of
>>>>>>>>>>>>>>>>>>>>> what would be needed for a SwiftUI-like declarative API 
>>>>>>>>>>>>>>>>>>>>> for creating them.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> There are of course a ton of missing bits still to be
>>>>>>>>>>>>>>>>>>>>> implemented, like counters, metrics, windowing, state, 
>>>>>>>>>>>>>>>>>>>>> timers, etc.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This should be fine and we can get the code documented
>>>>>>>>>>>>>>>>>>>> without these features. I think support for composites and 
>>>>>>>>>>>>>>>>>>>> adding an
>>>>>>>>>>>>>>>>>>>> external transform (see, Java
>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/External.java>,
>>>>>>>>>>>>>>>>>>>> Python
>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/python/apache_beam/transforms/external.py#L556>,
>>>>>>>>>>>>>>>>>>>> Go
>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/c7b7921185686da573f76ce7320817c32375c7d0/sdks/go/pkg/beam/xlang.go#L155>,
>>>>>>>>>>>>>>>>>>>> TypeScript
>>>>>>>>>>>>>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/transforms/external.ts>)
>>>>>>>>>>>>>>>>>>>> to add support for multi-lang will bring in a lot of 
>>>>>>>>>>>>>>>>>>>> features (for example,
>>>>>>>>>>>>>>>>>>>> I/O connectors) for free.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Any and all feedback welcome and happy to submit a PR
>>>>>>>>>>>>>>>>>>>>> if folks are interested, though the "Swift Way" would be 
>>>>>>>>>>>>>>>>>>>>> to have it in its
>>>>>>>>>>>>>>>>>>>>> own repo so that it can easily be used from the Swift 
>>>>>>>>>>>>>>>>>>>>> Package Manager.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +1 for creating a PR (may be as a draft initially).
>>>>>>>>>>>>>>>>>>>> Also it'll be easier to comment on a PR :)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> - Cham
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> B
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: [Request for Feedback] Swift SDK Prototype

Reply via email to