Thanks Cham, I wasn't up to speed as to where Xlang was wrt to those transforms.
On Wed, Jul 13, 2022 at 9:32 PM Chamikara Jayalath <chamik...@google.com> wrote: > +1 and this is exactly what I suggested as well. Python Dataframe, > RunInference, Python Map are available via x-lang for Java already [1] and > all three need/use simple UDFs to customize operation. There is some logic > that needs to be added to use Python transforms from Go SDK. As you > suggested there are many Java x-lang transforms that can use simple UDF > support as well. Either language combination should work to implement a > first proof of concept for WASI support while also addressing an existing > limitation. > > Thanks, > Cham > > [1] > https://github.com/apache/beam/tree/master/sdks/java/extensions/python/src/main/java/org/apache/beam/sdk/extensions/python/transforms > > On Wed, Jul 13, 2022 at 8:26 PM Kenneth Knowles <k...@apache.org> wrote: > >> I agree with Luke. Targeting little helper UDFs that go along with IOs >> are actually a major feature gap for xlang - like timestamp extractors that >> have to parse particular data formats. This could be a very useful place to >> try out the design options. I think we can simplify the problem by >> insisting that they are pure functions that do not access state or side >> inputs. >> >> On Wed, Jul 13, 2022 at 7:52 PM Luke Cwik via dev <dev@beam.apache.org> >> wrote: >> >>> I think an easier target would be to support things like >>> DynamicDestinations for Java IO connectors that are exposed as XLang for >>> Go/Python <https://goto.google.com/Python>. >>> >>> This is because Go/Python <https://goto.google.com/Python> have good >>> transpiling support to WebAssembly and we already exposed several Java IO >>> XLang connectors already so its about plumbing one more thing through for >>> these IO connectors. >>> >>> What interface should we expect for UDFs / UDAFs and should they be >>> purpose oriented or should we do something like we did for portability >>> where we have a graph of transforms that we feed arbitrary data in/out >>> from. The latter would have the benefit of allowing the runner to embed the >>> language execution directly within the runner and would pay the Wasm >>> communication tax instead of the gRPC communication tax. If we do the >>> former we still have the same issues where we have to be able to have a >>> type system to pass information between the host system and the transpiled >>> WebAssembly code that wraps the users UDF/UDAF and what if the UDF wants >>> access to side inputs or user state ... >>> >>> On Wed, Jul 13, 2022 at 4:09 PM Chamikara Jayalath <chamik...@google.com> >>> wrote: >>> >>>> >>>> >>>> On Wed, Jul 13, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >>>> >>>>> First we'll want to choose whether we want to target Wasm, WASI or >>>>> Wagi. >>>>> >>>> >>>> These terms are defined here >>>> <https://www.fermyon.com/blog/wasm-wasi-wagi?gclid=CjwKCAjw2rmWBhB4EiwAiJ0mtVhiTuMZmy4bJSlk4nJj1deNX3KueomLgkG8JMyGeiHJ3FJRPpVn7BoCs58QAvD_BwE> >>>> if anybody is confused as I am :) >>>> >>>> >>>>> WASI adds a lot of simple things like access to a clock, random number >>>>> generator, ... that would expand the scope of what transpiled code can do. >>>>> It is debatable whether we'll want the power to run the transpiled code as >>>>> a microservice. Using UDFs for XLang and UDFs and UDAFs for SQL as our >>>>> expected use cases seem to make WASI the best choice. The issue is in the >>>>> details as there is a hodgepodge of what language runtimes support and >>>>> what >>>>> are the limits of transpiling from a language to WebAssembly. >>>>> >>>> >>>> Agree that WASI seems like a good target since it gives access to >>>> additional system resources/tooling. >>>> >>>> >>>>> >>>>> Assuming WASI then it breaks down to these two aspects: >>>>> 1) Does the host language have a runtime? >>>>> Java: https://github.com/wasmerio/wasmer-java >>>>> Python: https://github.com/wasmerio/wasmer-python >>>>> Go: https://github.com/wasmerio/wasmer-go >>>>> >>>>> 2) How good is compilation from source language to WebAssembly >>>>> <https://github.com/appcypher/awesome-wasm-langs>? >>>>> Java (very limited): >>>>> Issues with garbage collection and the need to transpile/replace much >>>>> of the VM's capabilities plus the large standard library that everyone >>>>> uses >>>>> causes a lot of challenges. >>>>> JWebAssembly can do simple things like basic classes, strings, method >>>>> calls. Should be able to compile trivial lambdas to Wasm. There are other >>>>> choices but to my knowledge all are very limited. >>>>> >>>> >>>> That's unfortunate. But hopefully Java support will be implemented soon >>>> ? >>>> >>>> >>>>> >>>>> Python <https://pythondev.readthedocs.io/wasm.html> (quite good): >>>>> Features CPython Emscripten browser CPython Emscripten node Pyodide >>>>> subprocess (fork, exec) no no no >>>>> threads no YES WIP >>>>> file system no (only MEMFS) YES (Node raw FS) YES (IDB, Node, …) >>>>> shared extension modules WIP WIP YES >>>>> PyPI packages no no YES >>>>> sockets ? ? ? >>>>> urllib, asyncio no no WebAPI fetch / WebSocket >>>>> signals no WIP YES >>>>> >>>>> Go (excellent): Native support in go compiler >>>>> >>>> >>>> Great. Could executing Go UDFs in Python x-lang transforms (for >>>> example, Dataframe, RunInference, Python Map) be a good first target ? >>>> >>>> Thanks, >>>> Cham >>>> >>>> >>>>> >>>>> On Tue, Jul 12, 2022 at 5:51 PM Chamikara Jayalath via dev < >>>>> dev@beam.apache.org> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: >>>>>> >>>>>>> I have had interest in integrating Wasm within Beam as well as I >>>>>>> have had a lot of interest in improving language portability. >>>>>>> >>>>>>> Wasm has a lot of benefits over using docker containers to provide a >>>>>>> place for code to execute. From experience implementing working on the >>>>>>> Beam's portability layer and internal Flume knowledge: >>>>>>> * encoding and decoding data is expensive, anything which ensures >>>>>>> that in-memory representations for data being transferred from the host >>>>>>> to >>>>>>> the guest and back without transcoding/re-interpreting will be a big >>>>>>> win. >>>>>>> * reducing the amount of times we need to pass data between guest >>>>>>> and host and back is important >>>>>>> * fusing transforms reduces the number of data passing points >>>>>>> * batching (row or columnar) data reduces the amount of times we >>>>>>> need to pass data at each data passing point >>>>>>> * there are enough complicated use cases (state & timers, large >>>>>>> iterables, side inputs) where handling the trivial map/flatmap usecase >>>>>>> will >>>>>>> provide little value since it will prevent fusion >>>>>>> >>>>>>> I have been meaning to work on a prototype where we replace the >>>>>>> current gRPC + docker path with one in which we use Wasm to execute a >>>>>>> fused >>>>>>> graph re-using large parts of the existing code base written to support >>>>>>> portability. >>>>>>> >>>>>> >>>>>> This sounds very interesting. Probably using Wasm to implement proper >>>>>> UDF support for x-lang (for example, executing Python timestamp/watermark >>>>>> functions provided through the Kafka Python x-lang wrapper on the Java >>>>>> Kafka transform) will be a good first target ? My main question for this >>>>>> at >>>>>> this point is whether Wasm has adequate support for existing SDKs that >>>>>> use >>>>>> x-lang to implement this in a useful way. >>>>>> >>>>>> Thanks, >>>>>> Cham >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Re: Arrow - it's long been my dream to use Arrow for interchange in >>>>>>>> Beam [1]. I'm trying to move us in that direction with >>>>>>>> https://s.apache.org/batched-dofns (arrow is discussed briefly in >>>>>>>> the Future Work section). This gives the Python SDK a concept of >>>>>>>> batches of >>>>>>>> logical elements. My goal is Beam schemas + batches of logical >>>>>>>> elements -> >>>>>>>> Arrow RecordBatches. >>>>>>>> >>>>>>>> The Batched DoFn infrastructure is stable as of the 2.40.0 release >>>>>>>> cut and I'm currently working on adding what I'm calling a >>>>>>>> "BatchConverter" >>>>>>>> [2] for Beam Rows -> Arrow RecordBatch. Once that's done it could be >>>>>>>> interesting to experiment with a "WasmDoFn" that uses Arrow for >>>>>>>> interchange. >>>>>>>> >>>>>>>> Brian >>>>>>>> >>>>>>>> [1] >>>>>>>> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160 >>>>>>>> [2] >>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey < >>>>>>>> jenseng...@google.com> wrote: >>>>>>>> >>>>>>>>> Interesting. >>>>>>>>> >>>>>>>>> Robert, I was just served an ad for Redpanda when I searched for >>>>>>>>> "golang wasm" :) >>>>>>>>> >>>>>>>>> The storage and execution grid systems are all embracing wasm in >>>>>>>>> some way. >>>>>>>>> >>>>>>>>> https://redpanda.com/ >>>>>>>>> https://www.fluvio.io/ >>>>>>>>> https://temporal.io/ (Cadence fork by the Cadence folks, I met >>>>>>>>> Maxim the lead at Temporal at the 2020 Wasm Summit) >>>>>>>>> https://github.com/pachyderm/pachyderm no mention of wasm, yet. >>>>>>>>> >>>>>>>>> Keep the Wasm+Beam demos coming. >>>>>>>>> >>>>>>>>> Sean >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum < >>>>>>>>> sjvanros...@google.com> wrote: >>>>>>>>> >>>>>>>>>> I caught up with all the replies through the web interface, but I >>>>>>>>>> didn't have my list subscription set up correctly so my reply (TL;DR >>>>>>>>>> sample >>>>>>>>>> code available at https://github.com/sjvanrossum/beam-wasm) >>>>>>>>>> didn't come through until a bit later yesterday I think. >>>>>>>>>> >>>>>>>>>> Sean, I agree with your suggestion of Arrow as the interchange >>>>>>>>>> format for Wasm transforms and it's something I thought about >>>>>>>>>> exploring >>>>>>>>>> when I was adding serialization/deserialization of complex (meaning >>>>>>>>>> anything that's not an integer or float in the context of Wasm) data >>>>>>>>>> types >>>>>>>>>> in the demo. It's an unfortunate bit of overhead which could very >>>>>>>>>> well be >>>>>>>>>> solved with Arrow and shared memory between Wasm modules. >>>>>>>>>> I've seen Wasm transforms pop up in a few other places, notably >>>>>>>>>> in streaming data platforms like Fluvio and Redpanda and they seem >>>>>>>>>> to incur >>>>>>>>>> the same overhead when moving data into and out of the guest context >>>>>>>>>> so >>>>>>>>>> maybe it's negligible, but I haven't done any serious benchmark yet >>>>>>>>>> to >>>>>>>>>> validate that. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Steve >>>>>>>>>> >>>>>>>>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Obligatory mention that WASM is basically an architecture that >>>>>>>>>>> any well meaning compiler can target, eg the Go compiler >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/ >>>>>>>>>>> >>>>>>>>>>> (Among many articles for the last few years) >>>>>>>>>>> >>>>>>>>>>> Robert Burke >>>>>>>>>>> Beam Go Busybody >>>>>>>>>>> >>>>>>>>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey < >>>>>>>>>>> jenseng...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Heh, my stage fright was so strong, I didn't realize that the >>>>>>>>>>>> talk was recorded. :) >>>>>>>>>>>> >>>>>>>>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a >>>>>>>>>>>> bit rough. >>>>>>>>>>>> >>>>>>>>>>>> I haven't explored Wasm in Beam much since that talk. I think >>>>>>>>>>>> the most compelling use is in the portability of logic between data >>>>>>>>>>>> processing systems. Esp in the use of probabilistic data >>>>>>>>>>>> structures like >>>>>>>>>>>> Bloom Filters, Count-Min-Sketch, HyperLogLog, where it is nice to >>>>>>>>>>>> persist the data structure and use it on a different system. Like >>>>>>>>>>>> generating a bloom filter in Beam and using it inside of a BQ >>>>>>>>>>>> query w/o >>>>>>>>>>>> having to reimplement and test across many platforms. >>>>>>>>>>>> >>>>>>>>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere >>>>>>>>>>>> V8 exists, Wasm support exists for free unless the embedder goes >>>>>>>>>>>> out of >>>>>>>>>>>> their way to disable it. So it is supported in Deno/Node as well. >>>>>>>>>>>> In >>>>>>>>>>>> Python, Wasm support via Wasmtime >>>>>>>>>>>> <https://github.com/bytecodealliance/wasmtime> is really >>>>>>>>>>>> good. There are *many* options for execution environments, one of >>>>>>>>>>>> the >>>>>>>>>>>> downsides of passing through JS one is in string and number >>>>>>>>>>>> support(float/int64) issues, afaik. I could be wrong, maybe JS has >>>>>>>>>>>> fixed >>>>>>>>>>>> all this by now. >>>>>>>>>>>> >>>>>>>>>>>> The qualities in order of importance (for me) are >>>>>>>>>>>> >>>>>>>>>>>> 1. Portability, run the same code everywhere >>>>>>>>>>>> 2. Security, memory safety for the caller. Running Wasm >>>>>>>>>>>> inside of Python should never crash your Python interpreter. >>>>>>>>>>>> The capability >>>>>>>>>>>> model ensures that the Wasm module can only do what you allow >>>>>>>>>>>> it to >>>>>>>>>>>> 3. Performance (portable), compile once and run everywhere >>>>>>>>>>>> within some margin of native. Python makes this look good :) >>>>>>>>>>>> >>>>>>>>>>>> I think something worth exploring is moving opaque-ish Arrow >>>>>>>>>>>> objects around via Beam, so that Beam is now mostly in the control >>>>>>>>>>>> plane >>>>>>>>>>>> and computation happens in Wasm, this should reduce the >>>>>>>>>>>> serialization >>>>>>>>>>>> overhead and also get Python out of the datapath. >>>>>>>>>>>> >>>>>>>>>>>> I see someone exploring Wasm+Arrow here, >>>>>>>>>>>> https://github.com/domoritz/arrow-wasm >>>>>>>>>>>> >>>>>>>>>>>> Another possibly interesting avenue to explore is compiling >>>>>>>>>>>> command line programs to Wasi (WebAssembly System Interface), the >>>>>>>>>>>> POSIX >>>>>>>>>>>> like shim, so that they can be run inprocess without the >>>>>>>>>>>> fork/exec/pipe >>>>>>>>>>>> overhead of running a subprocess. A neat demo might be running >>>>>>>>>>>> something >>>>>>>>>>>> like Jq <https://stedolan.github.io/jq/> inside of a Beam job. >>>>>>>>>>>> >>>>>>>>>>>> Not to make Wasm sound like a Python only technology, it can be >>>>>>>>>>>> used via Java/JVM via >>>>>>>>>>>> >>>>>>>>>>>> - https://www.graalvm.org/22.1/reference-manual/wasm/ >>>>>>>>>>>> - https://github.com/kawamuray/wasmtime-java >>>>>>>>>>>> >>>>>>>>>>>> Sean >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada < >>>>>>>>>>>> pabl...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> adding Steven in case he didn't get the replies : ) >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins < >>>>>>>>>>>>> dpcoll...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> If we ever do anything with the JS runtime, this would seem >>>>>>>>>>>>>> to be the best place to run WASM. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette < >>>>>>>>>>>>>> bhule...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk >>>>>>>>>>>>>>> back in 2020 where he had integrated Rust with the Python SDK. >>>>>>>>>>>>>>> I thought he >>>>>>>>>>>>>>> used WebAssembly for that, but it looks like he used some other >>>>>>>>>>>>>>> approaches, >>>>>>>>>>>>>>> and his talk mentioned WebAssembly as future work. Not sure if >>>>>>>>>>>>>>> that was >>>>>>>>>>>>>>> ever explored. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o >>>>>>>>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay < >>>>>>>>>>>>>>> al...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested >>>>>>>>>>>>>>>> in the WebAssembly topic. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada < >>>>>>>>>>>>>>>> pabl...@google.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Would you open a pull request for it? Or at least share a >>>>>>>>>>>>>>>>> branch? : ) >>>>>>>>>>>>>>>>> Even if we don't want to merge it, it would be great to >>>>>>>>>>>>>>>>> have a PR as a way to showcase the work, its usefulness, and >>>>>>>>>>>>>>>>> receive >>>>>>>>>>>>>>>>> comments on this thread once we can see something more >>>>>>>>>>>>>>>>> specific. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum < >>>>>>>>>>>>>>>>> sjvanros...@google.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun >>>>>>>>>>>>>>>>>> to implement a transform which runs WebAssembly modules as a >>>>>>>>>>>>>>>>>> lightweight >>>>>>>>>>>>>>>>>> way to implement cross language transforms for languages >>>>>>>>>>>>>>>>>> which don't (yet) >>>>>>>>>>>>>>>>>> have a SDK implementation. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I've got a small proof of concept running in the Python >>>>>>>>>>>>>>>>>> SDK as a DoFn with Wasmer as the WebAssembly runtime and >>>>>>>>>>>>>>>>>> simple support for >>>>>>>>>>>>>>>>>> marshalling between the host and guest environment with the >>>>>>>>>>>>>>>>>> RowCoder. The >>>>>>>>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates >>>>>>>>>>>>>>>>>> the host >>>>>>>>>>>>>>>>>> copying the encoded element into the guest's memory, the >>>>>>>>>>>>>>>>>> guest copying >>>>>>>>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest >>>>>>>>>>>>>>>>>> calling back >>>>>>>>>>>>>>>>>> to the host with the offset and size and the host copying >>>>>>>>>>>>>>>>>> and decoding from >>>>>>>>>>>>>>>>>> the guest's memory. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going >>>>>>>>>>>>>>>>>> with this, since it was mostly just a "wouldn't it be cool >>>>>>>>>>>>>>>>>> if..." on a >>>>>>>>>>>>>>>>>> Monday afternoon, but I can see a few use cases for this. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Steve >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Steven van Rossum | Strategic Cloud Engineer | >>>>>>>>>>>>>>>>>> sjvanros...@google.com | (+31) (0)6 21174069 >>>>>>>>>>>>>>>>>> <+31%206%2021174069> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Google Netherlands B.V.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD >>>>>>>>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *If you received this communication by mistake, please >>>>>>>>>>>>>>>>>> don't forward it to anyone else (it may contain confidential >>>>>>>>>>>>>>>>>> or privileged >>>>>>>>>>>>>>>>>> information), please erase all copies of it, including all >>>>>>>>>>>>>>>>>> attachments, and >>>>>>>>>>>>>>>>>> please let the sender know it went to the wrong person. >>>>>>>>>>>>>>>>>> Thanks.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> *The above terms reflect a potential business >>>>>>>>>>>>>>>>>> arrangement, are provided solely as a basis for further >>>>>>>>>>>>>>>>>> discussion, and are >>>>>>>>>>>>>>>>>> not intended to be and do not constitute a legally binding >>>>>>>>>>>>>>>>>> obligation. No >>>>>>>>>>>>>>>>>> legally binding obligations will be created, implied, or >>>>>>>>>>>>>>>>>> inferred until an >>>>>>>>>>>>>>>>>> agreement in final form is executed in writing by all >>>>>>>>>>>>>>>>>> parties involved.* >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>