Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Weston Pace Tue, 26 Apr 2022 08:04:57 -0700

In addition to the memory copy it looks like WASM is going to bounds
check all loads/stores.  It does, at least, have some vectorized
load/store operations so that can help amortize the cost.  It appears
you aren't going to get the same performance as native today using
WASM but I'm guessing that is an active area of research and
investment.


On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
<[email protected]> wrote:
>
> I need to correct myself here - it is currently not possible to pass memory
> at zero cost between the engine and WASM interpreter. This is related to
> your point about safety - WASM provides memory safety guarantees because it
> controls the memory region that it can read from and write to. Therefore,
> currently passing data from and into WASM requires a memcopy.
>
> There is a proposal [1] to improve the situation, but currently would incur
> a cost in the query engine, since we would need to memcopy the regions
> around.
>
> I forgot that on my poc I passed the parquet file from js to WASM and
> de-serialized it to arrow directly in wasm - so memory was already being
> allocated from within WASM sandbox, not JS. Sorry for the confusion.
>
> [1] https://github.com/WebAssembly/design/issues/1439
>
> Best,
> Jorge
>
>
>
> On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Le 26/04/2022 à 16:30, Gavin Ray a écrit :
> > > Antoine, sandboxing comes into play from two places:
> > >
> > > 1) The WASM specification itself, which puts a bounds on the types of
> > > behaviors possible
> > > 2) The implementation of the WASM bytecode interpreter chosen, like Jorge
> > > mentioned in the comment above
> > >
> > > The wasmtime docs have a pretty solid section covering the sandboxing
> > > guarantees of WASM, and then the interpreter-specific behavior/abilities
> > of
> > > wasmtime FWIW:
> > > https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core
> >
> > This doesn't really answer my question, does it?
> >
> >
> > >
> > > On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > >>
> > >> Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :
> > >>>> Would WASM be able to interact in-process with non-WASM buffers
> > safely?
> > >>>
> > >>> AFAIK yes. My understanding from playing with it in JS is that a
> > >>> WASM-backed udf execution would be something like:
> > >>>
> > >>> 1. compile the C++/Rust/etc UDF to WASM (a binary format)
> > >>> 2. provide a small WASM-compiled middleware of the c data interface
> > that
> > >>> consumes (binary, c data interface pointers)
> > >>> 3. ship a WASM interpreter as part of the query engine
> > >>> 4. pass binary and c data interface pointers from the query engine
> > >> program
> > >>> to the interpreter with WASM-compiled middleware
> > >>
> > >> Ok, but the key word in my question was "safely". What mechanisms are in
> > >> place such that the WASM user function will not access Arrow buffers out
> > >> of bounds? Nothing really stands out in
> > >> https://webassembly.github.io/spec/core/index.html, but it's the first
> > >> time I try to have a look at the WebAssembly spec.
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >>
> > >>>
> > >>> Step 2 is necessary to read the buffers from FFI and output the result
> > >> back
> > >>> from the interpreter once the UDF is done, similar to what we do in
> > >>> datafusion to run Python from Rust. In the case of datafusion the
> > >> "binary"
> > >>> is a Python function, which has security implications since the Python
> > >>> interpreter allows everything by default.
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]>
> > >> wrote:
> > >>>
> > >>>>
> > >>>> Le 25/04/2022 à 23:04, David Li a écrit :
> > >>>>> The WebAssembly documentation has a rundown of the techniques used:
> > >>>> https://webassembly.org/docs/security/
> > >>>>>
> > >>>>> I think usually you would run WASM in-process, though we could indeed
> > >>>> also put it in a subprocess to further isolate things.
> > >>>>
> > >>>> Would WASM be able to interact in-process with non-WASM buffers
> > safely?
> > >>>> It's not obvious from reading the page above.
> > >>>>
> > >>>>
> > >>>>>
> > >>>>> It would be interesting to define the Flight "harness" protocol.
> > >>>> Handling heterogeneous arguments may require some evolution in Flight
> > >> (e.g.
> > >>>> if the function is non scalar and arguments are of different length -
> > >> we'd
> > >>>> need something like the ColumnBag proposal, so this might be a good
> > >> reason
> > >>>> to revive that).
> > >>>>>
> > >>>>> On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:
> > >>>>>> Le 25/04/2022 à 22:19, Wes McKinney a écrit :
> > >>>>>>> I was going to reply to this e-mail thread on user@ but thought I
> > >>>>>>> would start a new thread on dev@.
> > >>>>>>>
> > >>>>>>> Executing user-defined functions in memory, especially untrusted
> > >>>>>>> functions, in general is unsafe. For "trusted" functions, having an
> > >>>>>>> in-memory API for writing them in user languages is very useful. I
> > >>>>>>> remember tinkering with adding UDFs in Impala with LLVM IR, which
> > >>>>>>> would allow UDFs to have performance consistent with built-ins
> > >>>>>>> (because built-in functions are all inlined into code-generated
> > >>>>>>> expressions), but segfaults would bring down the server, so only
> > >>>>>>> admins could be trusted to add new UDFs.
> > >>>>>>>
> > >>>>>>> However, I wonder if we should eventually define an "external UDF"
> > >>>>>>> protocol and an example UDF "harness", using Flight to do RPC
> > across
> > >>>>>>> the process boundaries. So the idea is that an external local UDF
> > >>>>>>> Flight execution service is spun up, and then data is sent to the
> > UDF
> > >>>>>>> in a DoExchange call.
> > >>>>>>>
> > >>>>>>> As Jacques pointed out in an interview 1], a compelling solution to
> > >>>>>>> the UDF sandboxing problem is WASM. This allows "untrusted" WASM
> > >>>>>>> functions to be run safely in-process.
> > >>>>>>
> > >>>>>> How does the sandboxing work in this case? Is it simply executing
> > in a
> > >>>>>> separate process with restricted capabilities, or are other
> > mechanisms
> > >>>>>> put in place?
> > >>>>
> > >>>
> > >>
> > >
> >

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to