On Wed, Jun 29, 2022 at 9:31 AM Luke Cwik <lc...@google.com> wrote: > I have had interest in integrating Wasm within Beam as well as I have had > a lot of interest in improving language portability. > > Wasm has a lot of benefits over using docker containers to provide a place > for code to execute. From experience implementing working on the Beam's > portability layer and internal Flume knowledge: > * encoding and decoding data is expensive, anything which ensures that > in-memory representations for data being transferred from the host to the > guest and back without transcoding/re-interpreting will be a big win. > * reducing the amount of times we need to pass data between guest and host > and back is important > * fusing transforms reduces the number of data passing points > * batching (row or columnar) data reduces the amount of times we need to > pass data at each data passing point > * there are enough complicated use cases (state & timers, large iterables, > side inputs) where handling the trivial map/flatmap usecase will provide > little value since it will prevent fusion > > I have been meaning to work on a prototype where we replace the current > gRPC + docker path with one in which we use Wasm to execute a fused graph > re-using large parts of the existing code base written to support > portability. >
This sounds very interesting. Probably using Wasm to implement proper UDF support for x-lang (for example, executing Python timestamp/watermark functions provided through the Kafka Python x-lang wrapper on the Java Kafka transform) will be a good first target ? My main question for this at this point is whether Wasm has adequate support for existing SDKs that use x-lang to implement this in a useful way. Thanks, Cham > > > On Fri, Jun 17, 2022 at 2:19 PM Brian Hulette <bhule...@google.com> wrote: > >> Re: Arrow - it's long been my dream to use Arrow for interchange in Beam >> [1]. I'm trying to move us in that direction with >> https://s.apache.org/batched-dofns (arrow is discussed briefly in the >> Future Work section). This gives the Python SDK a concept of batches of >> logical elements. My goal is Beam schemas + batches of logical elements -> >> Arrow RecordBatches. >> >> The Batched DoFn infrastructure is stable as of the 2.40.0 release cut >> and I'm currently working on adding what I'm calling a "BatchConverter" [2] >> for Beam Rows -> Arrow RecordBatch. Once that's done it could be >> interesting to experiment with a "WasmDoFn" that uses Arrow for interchange. >> >> Brian >> >> [1] >> https://docs.google.com/presentation/d/1D9vigwYTCuAuz_CO8nex3GK3h873acmQJE5Ui8TFsDY/edit#slide=id.g608e662464_0_160 >> [2] >> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/batch.py >> >> >> On Thu, Jun 16, 2022 at 10:55 AM Sean Jensen-Grey <jenseng...@google.com> >> wrote: >> >>> Interesting. >>> >>> Robert, I was just served an ad for Redpanda when I searched for "golang >>> wasm" :) >>> >>> The storage and execution grid systems are all embracing wasm in some >>> way. >>> >>> https://redpanda.com/ >>> https://www.fluvio.io/ >>> https://temporal.io/ (Cadence fork by the Cadence folks, I met Maxim >>> the lead at Temporal at the 2020 Wasm Summit) >>> https://github.com/pachyderm/pachyderm no mention of wasm, yet. >>> >>> Keep the Wasm+Beam demos coming. >>> >>> Sean >>> >>> >>> >>> On Thu, Jun 16, 2022 at 4:23 AM Steven van Rossum < >>> sjvanros...@google.com> wrote: >>> >>>> I caught up with all the replies through the web interface, but I >>>> didn't have my list subscription set up correctly so my reply (TL;DR sample >>>> code available at https://github.com/sjvanrossum/beam-wasm) didn't >>>> come through until a bit later yesterday I think. >>>> >>>> Sean, I agree with your suggestion of Arrow as the interchange format >>>> for Wasm transforms and it's something I thought about exploring when I was >>>> adding serialization/deserialization of complex (meaning anything that's >>>> not an integer or float in the context of Wasm) data types in the demo. >>>> It's an unfortunate bit of overhead which could very well be solved with >>>> Arrow and shared memory between Wasm modules. >>>> I've seen Wasm transforms pop up in a few other places, notably in >>>> streaming data platforms like Fluvio and Redpanda and they seem to incur >>>> the same overhead when moving data into and out of the guest context so >>>> maybe it's negligible, but I haven't done any serious benchmark yet to >>>> validate that. >>>> >>>> Regards, >>>> >>>> Steve >>>> >>>> On Thu, Jun 16, 2022 at 3:04 AM Robert Burke <rob...@frantil.com> >>>> wrote: >>>> >>>>> Obligatory mention that WASM is basically an architecture that any >>>>> well meaning compiler can target, eg the Go compiler >>>>> >>>>> >>>>> https://www.bradcypert.com/an-introduction-to-targeting-web-assembly-with-golang/ >>>>> >>>>> (Among many articles for the last few years) >>>>> >>>>> Robert Burke >>>>> Beam Go Busybody >>>>> >>>>> On Wed, Jun 15, 2022, 2:04 PM Sean Jensen-Grey <jenseng...@google.com> >>>>> wrote: >>>>> >>>>>> Heh, my stage fright was so strong, I didn't realize that the talk >>>>>> was recorded. :) >>>>>> >>>>>> Steven, I'd love to chat about Wasm in Beam. This email is a bit >>>>>> rough. >>>>>> >>>>>> I haven't explored Wasm in Beam much since that talk. I think the >>>>>> most compelling use is in the portability of logic between data >>>>>> processing >>>>>> systems. Esp in the use of probabilistic data structures like Bloom >>>>>> Filters, Count-Min-Sketch, HyperLogLog, where it is nice to persist the >>>>>> data structure and use it on a different system. Like generating a bloom >>>>>> filter in Beam and using it inside of a BQ query w/o having to >>>>>> reimplement >>>>>> and test across many platforms. >>>>>> >>>>>> I have used Wasm in BQ, as BQ UDFs are driven by V8. Anywhere V8 >>>>>> exists, Wasm support exists for free unless the embedder goes out of >>>>>> their >>>>>> way to disable it. So it is supported in Deno/Node as well. In Python, >>>>>> Wasm >>>>>> support via Wasmtime <https://github.com/bytecodealliance/wasmtime> >>>>>> is really good. There are *many* options for execution environments, one >>>>>> of the downsides of passing through JS one is in string and number >>>>>> support(float/int64) issues, afaik. I could be wrong, maybe JS has fixed >>>>>> all this by now. >>>>>> >>>>>> The qualities in order of importance (for me) are >>>>>> >>>>>> 1. Portability, run the same code everywhere >>>>>> 2. Security, memory safety for the caller. Running Wasm inside of >>>>>> Python should never crash your Python interpreter. The capability >>>>>> model >>>>>> ensures that the Wasm module can only do what you allow it to >>>>>> 3. Performance (portable), compile once and run everywhere within >>>>>> some margin of native. Python makes this look good :) >>>>>> >>>>>> I think something worth exploring is moving opaque-ish Arrow objects >>>>>> around via Beam, so that Beam is now mostly in the control plane and >>>>>> computation happens in Wasm, this should reduce the serialization >>>>>> overhead >>>>>> and also get Python out of the datapath. >>>>>> >>>>>> I see someone exploring Wasm+Arrow here, >>>>>> https://github.com/domoritz/arrow-wasm >>>>>> >>>>>> Another possibly interesting avenue to explore is compiling command >>>>>> line programs to Wasi (WebAssembly System Interface), the POSIX like >>>>>> shim, >>>>>> so that they can be run inprocess without the fork/exec/pipe overhead of >>>>>> running a subprocess. A neat demo might be running something like Jq >>>>>> <https://stedolan.github.io/jq/> inside of a Beam job. >>>>>> >>>>>> Not to make Wasm sound like a Python only technology, it can be used >>>>>> via Java/JVM via >>>>>> >>>>>> - https://www.graalvm.org/22.1/reference-manual/wasm/ >>>>>> - https://github.com/kawamuray/wasmtime-java >>>>>> >>>>>> Sean >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jun 15, 2022 at 9:35 AM Pablo Estrada <pabl...@google.com> >>>>>> wrote: >>>>>> >>>>>>> adding Steven in case he didn't get the replies : ) >>>>>>> >>>>>>> On Wed, Jun 15, 2022 at 9:29 AM Daniel Collins <dpcoll...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> If we ever do anything with the JS runtime, this would seem to be >>>>>>>> the best place to run WASM. >>>>>>>> >>>>>>>> On Tue, Jun 14, 2022 at 8:13 PM Brian Hulette <bhule...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> FYI: @Sean Jensen-Grey <jenseng...@google.com> gave a talk back >>>>>>>>> in 2020 where he had integrated Rust with the Python SDK. I thought >>>>>>>>> he used >>>>>>>>> WebAssembly for that, but it looks like he used some other >>>>>>>>> approaches, and >>>>>>>>> his talk mentioned WebAssembly as future work. Not sure if that was >>>>>>>>> ever >>>>>>>>> explored. >>>>>>>>> >>>>>>>>> https://www.youtube.com/watch?v=fZK_Tiu7q1o >>>>>>>>> https://github.com/seanjensengrey/beam-rust-python-java >>>>>>>>> >>>>>>>>> Brian >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Jun 14, 2022 at 5:05 PM Ahmet Altay <al...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Adding @Lukasz Cwik <lc...@google.com> - he was interested in >>>>>>>>>> the WebAssembly topic. >>>>>>>>>> >>>>>>>>>> On Tue, Jun 14, 2022 at 3:09 PM Pablo Estrada <pabl...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Would you open a pull request for it? Or at least share a >>>>>>>>>>> branch? : ) >>>>>>>>>>> Even if we don't want to merge it, it would be great to have a >>>>>>>>>>> PR as a way to showcase the work, its usefulness, and receive >>>>>>>>>>> comments on >>>>>>>>>>> this thread once we can see something more specific. >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 14, 2022 at 3:05 PM Steven van Rossum < >>>>>>>>>>> sjvanros...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi folks, >>>>>>>>>>>> >>>>>>>>>>>> I had some spare time yesterday and thought it'd be fun to >>>>>>>>>>>> implement a transform which runs WebAssembly modules as a >>>>>>>>>>>> lightweight way >>>>>>>>>>>> to implement cross language transforms for languages which don't >>>>>>>>>>>> (yet) have >>>>>>>>>>>> a SDK implementation. >>>>>>>>>>>> >>>>>>>>>>>> I've got a small proof of concept running in the Python SDK as >>>>>>>>>>>> a DoFn with Wasmer as the WebAssembly runtime and simple support >>>>>>>>>>>> for >>>>>>>>>>>> marshalling between the host and guest environment with the >>>>>>>>>>>> RowCoder. The >>>>>>>>>>>> module I've constructed is mostly useless, but demonstrates the >>>>>>>>>>>> host >>>>>>>>>>>> copying the encoded element into the guest's memory, the guest >>>>>>>>>>>> copying >>>>>>>>>>>> those bytes elsewhere in its linear memory buffer, the guest >>>>>>>>>>>> calling back >>>>>>>>>>>> to the host with the offset and size and the host copying and >>>>>>>>>>>> decoding from >>>>>>>>>>>> the guest's memory. >>>>>>>>>>>> >>>>>>>>>>>> Any thoughts/interest? I'm not sure where I was going with >>>>>>>>>>>> this, since it was mostly just a "wouldn't it be cool if..." on a >>>>>>>>>>>> Monday >>>>>>>>>>>> afternoon, but I can see a few use cases for this. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Steve >>>>>>>>>>>> >>>>>>>>>>>> Steven van Rossum | Strategic Cloud Engineer | >>>>>>>>>>>> sjvanros...@google.com | (+31) (0)6 21174069 >>>>>>>>>>>> <+31%206%2021174069> >>>>>>>>>>>> >>>>>>>>>>>> *Google Netherlands B.V.* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Reg: Claude Debussylaan 34 15th floor, 1082 MD >>>>>>>>>>>> Amsterdam34198589NETHERLANDSVAT / Tax ID:- 812788515 B01* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *If you received this communication by mistake, please don't >>>>>>>>>>>> forward it to anyone else (it may contain confidential or >>>>>>>>>>>> privileged >>>>>>>>>>>> information), please erase all copies of it, including all >>>>>>>>>>>> attachments, and >>>>>>>>>>>> please let the sender know it went to the wrong person. Thanks.* >>>>>>>>>>>> >>>>>>>>>>>> *The above terms reflect a potential business arrangement, are >>>>>>>>>>>> provided solely as a basis for further discussion, and are not >>>>>>>>>>>> intended to >>>>>>>>>>>> be and do not constitute a legally binding obligation. No legally >>>>>>>>>>>> binding >>>>>>>>>>>> obligations will be created, implied, or inferred until an >>>>>>>>>>>> agreement in >>>>>>>>>>>> final form is executed in writing by all parties involved.* >>>>>>>>>>>> >>>>>>>>>>>