Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Antoine Pitrou Tue, 26 Apr 2022 07:22:23 -0700


Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :

Would WASM be able to interact in-process with non-WASM buffers safely?


AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface that
consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine program
to the interpreter with WASM-compiled middleware

Ok, but the key word in my question was "safely". What mechanisms are inplace such that the WASM user function will not access Arrow buffers outof bounds? Nothing really stands out inhttps://webassembly.github.io/spec/core/index.html, but it's the firsttime I try to have a look at the WebAssembly spec.


Regards

Antoine.


Step 2 is necessary to read the buffers from FFI and output the result back
from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the "binary"
is a Python function, which has security implications since the Python
interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]> wrote:


Le 25/04/2022 à 23:04, David Li a écrit :

The WebAssembly documentation has a rundown of the techniques used:

https://webassembly.org/docs/security/


I think usually you would run WASM in-process, though we could indeed

also put it in a subprocess to further isolate things.

Would WASM be able to interact in-process with non-WASM buffers safely?
It's not obvious from reading the page above.


It would be interesting to define the Flight "harness" protocol.

Handling heterogeneous arguments may require some evolution in Flight (e.g.
if the function is non scalar and arguments are of different length - we'd
need something like the ColumnBag proposal, so this might be a good reason
to revive that).


On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:

Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but thought I
would start a new thread on dev@.

Executing user-defined functions in memory, especially untrusted
functions, in general is unsafe. For "trusted" functions, having an
in-memory API for writing them in user languages is very useful. I
remember tinkering with adding UDFs in Impala with LLVM IR, which
would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so only
admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external UDF"
protocol and an example UDF "harness", using Flight to do RPC across
the process boundaries. So the idea is that an external local UDF
Flight execution service is spun up, and then data is sent to the UDF
in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling solution to
the UDF sandboxing problem is WASM. This allows "untrusted" WASM
functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply executing in a
separate process with restricted capabilities, or are other mechanisms
put in place?

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to