Re: [Format][Union] polymorphic vectors vs ADT style vectors

Felipe Oliveira Carvalho Tue, 02 Apr 2024 15:33:13 -0700

Algebraic Data Types (Sums and Products) are very abstract. This means
they don't fully specify a concrete/physical layout [1]: different
physical layouts can match the same algebraic definition. As an
in-memory data format specification, Arrow doesn't and shouldn't
rigidly specify concretization rules. It's up to the programmer or a
semantic layer above to find the best way to **concretize** abstract
types.

"Sum Types" and "Tagged Unions" are often used interchangeably, but
sum types are more powerful than what tagged unions can mean in C and
much more powerful than Arrow's tagged unions.

type 'a list = Nil [of void] | Cons of ('a * 'a list)

can be made concrete in C by using a struct containing a tag, an 'a,
and a void* pointer (the pointer is necessary to make the recursion
work). Arrow offers dedicated List types instead of defining it via a
more general abstraction. Additionally, Arrow unions can't handle
unbounded recursion — you would have to find out, a-priori, the size
of the biggest list, to allocate the array with enough Cons levels.

> We have an `op` union vector with three child vectors `put`, `delete`, 
> `erase`. `delete` and `erase` have the same type but represent different 
> things.

I'm going to assume delete and erase are receiving some kind of
identifier and write down the abstract sum type:

type id = fixed_size_binary(16)  // 16-byte UUID
type 'a op = Put of (id * 'a) | Delete of id | Erase of id

So the question becomes how to better concretize this abstract
algebraic sum type in Arrow.

If you use Arrow's tagged unions directly, you will get a very
suboptimal in-memory layout: unnecessary nesting and data
dependencies. Think of the | operator in types as a very expensive
operator that you should avoid by performing algebraic transformations
until you eliminate | completely or end with something irreducible
like `int8 | string`.

> You are thinking of a product type which is better represented by a 
> StructVector with nullable child vectors.

What Weston did was intuitively apply transformations to the original
sum of products type into an equivalent and "flatter" product type.
People familiar with the Arrow internals do this intuitively.

Similarly to what Arrow Tagged Unions do internally [2], let's create
an 8-bit tag for op:

type op_tag = int8  (* for PutTag=0, DeleteTag=1, EraseTag=2 *)

and rewrite op to an storage-friendly supertype [3] that is as close
as possible to the original op.

type 'a op
 = (0 * id * 'a) | (1 * id) | (2 * id)
<= op_tag * ((id * 'a) | id | id) (* this type can contain more values
than original op (malformed ones) *)
 = op_tag * ((id * 'a) | id) (* since we're dealing with unions
abstractly here id|id <=> id *)
<= op_tag * ((id * 'a) option | id option) (* turn sum into product of
optionals *) [4]
<= op_tag * id * ('a option | empty option) (* unnest type present in
all branches *)
 = op_tag * id * ('a option | empty)
 = op_tag * id * 'a option

Now, to guarantee type safety, you define all your original sum type
constructors in terms of the constructors of this supertype:

let Put id value = (PutTag, id, Some value)
let Delete id = (DeleteTag, id, None)
let Erase id = (EraseTag, id, None)

and rewrite all the destructurings in terms of the new type:

let get_id = function Put (id, _) | Delete id | Erase id -> id
let get_payload = function Put (_, payload) -> Some payload | None

becomes much less indirect

let get_id = function (_, id, _) -> id
let get_payload = function (_, _, payload) -> payload

[1] a programming language compiler/runtime has a lot of freedom with them
[2] Arrow tagged unions only support 127 branches because 8-bit signed
integers are used for the tag
[3] in general, to store values of a type T all you need is a storage
system that can store values of a super-type of T. To guarantee you
don't read values that don't belong to T, you have to constrain/check
the values when you store them. No storage type system can ever be as
rich as the arbitrary checks you can impose before you store the data.
[4] turning `a|b|c` into `a option * b option * c option` is
advantageous because every type `a` can very cheaply become `a option`
by having a validity bitmap in buffers [0]. These continuous bitmaps
allow skipping nulls very quickly with the use of bit-counting
instructions. To skip all the values of a certain type-tag in an union
array, the full 8 bit tag has to be checked value by value.

On Tue, Apr 2, 2024 at 10:29 AM Finn Völkel <f...@juxt.pro> wrote:
>
> @weston I think my mentioning of ADT was a mistake. I am just thinking of
> sum types (https://en.wikipedia.org/wiki/Tagged_union) which I should have
> just called differently.
> You are thinking of a product type which is better represented by a
> StructVector with nullable child vectors.
>
> @antoine Thanks for the clarification.
>
> On Tue, 2 Apr 2024 at 14:47, Weston Pace <weston.p...@gmail.com> wrote:
>
> > Wouldn't support for ADT require expressing more than 1 type id per
> > record?  In other words, if `put` has type id 1, `delete` has type id 2,
> > and `erase` has type id 3 then there is no way to express something is (for
> > example) both type id 1 and type id 3 because you can only have one type id
> > per record.
> >
> > If that understanding is correct then it seems you can always encode world
> > 2 into world 1 by exhaustively listing out the combinations.  In other
> > words, `put` is the LSB, `delete` is bit 2, and `erase` is bit 3 and you
> > have:
> >
> > 7 - put/delete/erase
> > 6 - delete/erase
> > 5 - erase/put
> > 4 - erase
> > 3 - put/delete
> > 2 - delete
> > 1 - put
> >
> > On Tue, Apr 2, 2024 at 4:36 AM Finn Völkel <f...@juxt.pro> wrote:
> >
> > > I also meant Algebraic Data Type not Abstract Data Type (too many
> > > acronymns).
> > >
> > > On Tue, 2 Apr 2024 at 13:28, Antoine Pitrou <anto...@python.org> wrote:
> > >
> > > >
> > > > Thanks. The Arrow spec does support multiple union members with the
> > same
> > > > type, but not all implementations do. The C++ implementation should
> > > > support it, though to my surprise we do not seem to have any tests for
> > > it.
> > > >
> > > > If the Java implementation doesn't, then you can probably open an issue
> > > > for it (and even submit a PR if you would like to tackle it).
> > > >
> > > > I've also opened https://github.com/apache/arrow/issues/40947 to
> > create
> > > > integration tests for this.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 02/04/2024 à 13:19, Finn Völkel a écrit :
> > > > >> Can you explain what ADT means ?
> > > > >
> > > > > Sorry about that. ADT stands for Abstract Data Type. What do I mean
> > by
> > > an
> > > > > ADT style vector?
> > > > >
> > > > > Let's take an example from the project I am on. We have an `op` union
> > > > > vector with three child vectors `put`, `delete`, `erase`. `delete`
> > and
> > > > > `erase` have the same type but represent different things.
> > > > >
> > > > > On Tue, 2 Apr 2024 at 13:16, Steve Kim <chairm...@gmail.com> wrote:
> > > > >
> > > > >> Thank you for asking this question. I have the same question.
> > > > >>
> > > > >> I noted a similar problem in the c++/python implementation:
> > > > >>
> > https://github.com/apache/arrow/issues/19157#issuecomment-1528037394
> > > > >>
> > > > >> On Tue, Apr 2, 2024, 04:30 Finn Völkel <f...@juxt.pro> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> my question primarily concerns the union layout described at
> > > > >>> https://arrow.apache.org/docs/format/Columnar.html#union-layout
> > > > >>>
> > > > >>> There are two ways to use unions:
> > > > >>>
> > > > >>>     - polymorphic vectors (world 1)
> > > > >>>     - ADT style vectors (world 2)
> > > > >>>
> > > > >>> In world 1 you have a vector that stores different types. In the
> > ADT
> > > > >> world
> > > > >>> you could have multiple child vectors with the same type but
> > > different
> > > > >> type
> > > > >>> ids in the union type vector. The difference is apparent if you
> > want
> > > to
> > > > >> use
> > > > >>> two BigIntVectors as children which doesn't exist in world 1.
> > World 1
> > > > is
> > > > >> a
> > > > >>> subset of world 2.
> > > > >>>
> > > > >>> The spec (to my understanding) doesn’t explicitly forbid world 2,
> > but
> > > > the
> > > > >>> implementation we have been using (Java) has been making the
> > > assumption
> > > > >> of
> > > > >>> being in world 1 (a union only having ONE child of each type). We
> > > > >> sometimes
> > > > >>> use union in the ADT style which has led to problems down the road.
> > > > >>>
> > > > >>> Could someone clarify what the specification allows and what it
> > > doesn’t
> > > > >>> allow? Could we tighten the specification after that clarification?
> > > > >>>
> > > > >>> Best, Finn
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> >

Re: [Format][Union] polymorphic vectors vs ADT style vectors

Reply via email to