Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Antoine Pitrou Tue, 26 Apr 2022 12:37:18 -0700

Also, this may sound counter-intuitive, but LLVM IR is actuallyarchitecture-specific because it is tied to various parameters of thearchitecture such as type widths and alignments.



Le 26/04/2022 à 19:51, Sasha Krassovsky a écrit :

I think I can help answer these:
1) LLVM IR is an intermediate representation for compilers, WASM is an open 
standard for sandboxed computation. They fulfill different but complimentary 
roles. If the query engine were handed LLVM IR, it would still have to JIT the 
IR to wasm in order to maintain the sandboxing guarantees. It would also tie 
the query engine to LLVM, whereas there may be other wasm generators out there.

2) The idea would be for the user to use some external tool or compiler that 
generates wasm, and pass the wasm to the query engine. This would mean that you 
could write a UDF in any language of your choosing. It seems like it wouldn’t 
be much work to use your existing numpy + numba pipeline as well, you would 
just have to add a step to generate wasm from your LLVM IR before passing it to 
the engine.

Sasha

26 апр. 2022 г., в 10:39, Li Jin <[email protected]> написал(а):

This is a very interesting topic and one that we care a lot about when
using/thinking about Arrow compute.

I come from Python data analytics where most of our users use Pandas/Numpy.
This is also my first time learning about WASM and my previous
understanding of "Python UDF in Arrow C++ compute" engine is more of:

UDF written in NumPy API -> Using Numba to compile UDF into LLVM IR ->
Execute LLVM IR within Arrow C++ engine on Arrow Arrays

Which in my understanding is similar to UDFs in Impala with LLVM IR that
Wes mentioned.

I wonder how WASM potentially changing things. A couple of questions:
(1) What is the advantage of using WASM instead of sth like LLVM IR?
(2) Do we envision using sth like a NumPy API as the language that writes
these UDFs or sth completely different? (Another DSL?)

Li

On Tue, Apr 26, 2022 at 11:04 AM Weston Pace <[email protected]> wrote:

In addition to the memory copy it looks like WASM is going to bounds
check all loads/stores.  It does, at least, have some vectorized
load/store operations so that can help amortize the cost.  It appears
you aren't going to get the same performance as native today using
WASM but I'm guessing that is an active area of research and
investment.

On Tue, Apr 26, 2022 at 5:00 AM Jorge Cardoso Leitão
<[email protected]> wrote:

I need to correct myself here - it is currently not possible to pass

memory

at zero cost between the engine and WASM interpreter. This is related to
your point about safety - WASM provides memory safety guarantees because

it

controls the memory region that it can read from and write to. Therefore,
currently passing data from and into WASM requires a memcopy.

There is a proposal [1] to improve the situation, but currently would

incur

a cost in the query engine, since we would need to memcopy the regions
around.

I forgot that on my poc I passed the parquet file from js to WASM and
de-serialized it to arrow directly in wasm - so memory was already being
allocated from within WASM sandbox, not JS. Sorry for the confusion.

[1] https://github.com/WebAssembly/design/issues/1439

Best,
Jorge



On Tue, Apr 26, 2022 at 3:43 PM Antoine Pitrou <[email protected]>

wrote:


Le 26/04/2022 à 16:30, Gavin Ray a écrit :

Antoine, sandboxing comes into play from two places:

1) The WASM specification itself, which puts a bounds on the types of
behaviors possible
2) The implementation of the WASM bytecode interpreter chosen, like

Jorge

mentioned in the comment above

The wasmtime docs have a pretty solid section covering the sandboxing
guarantees of WASM, and then the interpreter-specific

behavior/abilities

of

wasmtime FWIW:
https://docs.wasmtime.dev/security-sandboxing.html#webassembly-core


This doesn't really answer my question, does it?


On Tue, Apr 26, 2022 at 10:22 AM Antoine Pitrou <[email protected]>

wrote:


Le 26/04/2022 à 16:18, Jorge Cardoso Leitão a écrit :

Would WASM be able to interact in-process with non-WASM buffers

safely?


AFAIK yes. My understanding from playing with it in JS is that a
WASM-backed udf execution would be something like:

1. compile the C++/Rust/etc UDF to WASM (a binary format)
2. provide a small WASM-compiled middleware of the c data interface

that

consumes (binary, c data interface pointers)
3. ship a WASM interpreter as part of the query engine
4. pass binary and c data interface pointers from the query engine

program

to the interpreter with WASM-compiled middleware


Ok, but the key word in my question was "safely". What mechanisms

are in

place such that the WASM user function will not access Arrow

buffers out

of bounds? Nothing really stands out in
https://webassembly.github.io/spec/core/index.html, but it's the

first

time I try to have a look at the WebAssembly spec.

Regards

Antoine.


Step 2 is necessary to read the buffers from FFI and output the

result

back

from the interpreter once the UDF is done, similar to what we do in
datafusion to run Python from Rust. In the case of datafusion the

"binary"

is a Python function, which has security implications since the

Python

interpreter allows everything by default.

Best,
Jorge



On Tue, Apr 26, 2022 at 2:56 PM Antoine Pitrou <[email protected]

wrote:


Le 25/04/2022 à 23:04, David Li a écrit :

The WebAssembly documentation has a rundown of the techniques

used:

https://webassembly.org/docs/security/


I think usually you would run WASM in-process, though we could

indeed

also put it in a subprocess to further isolate things.

Would WASM be able to interact in-process with non-WASM buffers

safely?

It's not obvious from reading the page above.


It would be interesting to define the Flight "harness" protocol.

Handling heterogeneous arguments may require some evolution in

Flight

(e.g.

if the function is non scalar and arguments are of different

length -

we'd

need something like the ColumnBag proposal, so this might be a

good

reason

to revive that).


On Mon, Apr 25, 2022, at 16:35, Antoine Pitrou wrote:

Le 25/04/2022 à 22:19, Wes McKinney a écrit :

I was going to reply to this e-mail thread on user@ but

thought I

would start a new thread on dev@.

Executing user-defined functions in memory, especially

untrusted

functions, in general is unsafe. For "trusted" functions,

having an

in-memory API for writing them in user languages is very

useful. I

remember tinkering with adding UDFs in Impala with LLVM IR,

which

would allow UDFs to have performance consistent with built-ins
(because built-in functions are all inlined into code-generated
expressions), but segfaults would bring down the server, so

only

admins could be trusted to add new UDFs.

However, I wonder if we should eventually define an "external

UDF"

protocol and an example UDF "harness", using Flight to do RPC

across

the process boundaries. So the idea is that an external local

UDF

Flight execution service is spun up, and then data is sent to

the

UDF

in a DoExchange call.

As Jacques pointed out in an interview 1], a compelling

solution to

the UDF sandboxing problem is WASM. This allows "untrusted"

WASM

functions to be run safely in-process.


How does the sandboxing work in this case? Is it simply

executing

in a

separate process with restricted capabilities, or are other

mechanisms

put in place?

Re: Designing standards for "sandboxed" Arrow user-defined functions [was Re: User defined "Arrow Compute Function"]

Reply via email to