Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Antoine Pitrou Tue, 14 May 2024 09:16:08 -0700

I think these flags should be advisory and consumers should be free toignore them. However, some consumers apparently would benefit from themto more faithfully represent the producer's intention.

For example, in Arrow C++, we could perhaps have a ImportDatum functionwhose actual return type would depend on which flags are set (though I'mnot sure what the default behavior, in the absence of any flags, should be).


Regards

Antoine.


Le 25/04/2024 à 06:54, Weston Pace a écrit :

What should be done if a system doesn't have a record batch concept?  For
example, if I remember correctly, Velox works this way and only has a "row
vector" (struct array) but no equivalent to record batch.  Should these
systems reject a record batch or should they just accept it as a struct
array?

What about ArrowArrayStream?  Must it always return "record batch" or can
it return single columns? Should the stream be homogenous or is it valid if
some arrays are single columns and some are record batches?

If a scalar comes across on its own then what is the length of the scalar?
I think this might be a reason to prefer something like REE for scalars
since the length would be encoded along with the scalar.


On Wed, Apr 24, 2024 at 6:21 PM Keith Kraus <[email protected]>
wrote:

I believe several array implementations (e.g., numpy, R) are able to

broadcast/recycle a length-1 array. Run-end-encoding is also an option that
would make that broadcast explicit without expanding the scalar.

Some libraries behave this way, i.e. Polars, but others like Pandas and
cuDF only broadcast up dimensions. I.E. scalars can be broadcast across
columns or dataframes, columns can be broadcast across dataframes, but
length 1 columns do not broadcast across columns where trying to add say a
length 5 and length 1 column isn't valid but adding a length 5 column and a
scalar is. Additionally, it differentiates between operations that are
guaranteed to return a scalar, i.e. something like a reduction of `sum()`
versus operations that can return a length 1 column depending on the data,
i.e. `unique()`.

For UDFs: UDFs are a system-specific interface. Presumably, that

interface can encode whether an Arrow array is meant to represent a column
or scalar (or record batch or ...). Again, because Arrow doesn't define
scalars (for now...) or UDFs, the UDF interface needs to layer its own
semantics on top of Arrow.


In other words, I don't think the C Data Interface was meant to be

something where you can expect to _only_ pass the ArrowDeviceArray around
and have it encode all the semantics for a particular system, right? The
UDF example is something where the engine would pass an ArrowDeviceArray
plus additional context.

There's a growing trend in execution engines supporting UDFs of Arrow in
and Arrow out, DuckDB, PySpark, DataFusion, etc. Many of them have
different options of passing in RecordBatches vs Arrays where they
currently rely on the Arrow library containers in order to differentiate
them.

Additionally, libcudf has some generic functions that currently use Arrow
C++ containers (
https://docs.rapids.ai/api/cudf/stable/libcudf_docs/api_docs/interop_arrow/
)
for differentiating between RecordBatches, Arrays, and Scalars which could
be moved to using the C Data Interfaces, Polars has similar (
https://docs.pola.rs/py-polars/html/reference/api/polars.from_arrow.html)
that currently uses PyArrow containers, and you could imagine other
DataFrame libraries having similar.

Ultimately, there's a desire to be able to move Arrow data between
different libraries, applications, frameworks, etc. and given Arrow
implementations like C++, Rust, and Go have containers for RecordBatches,
Arrays, and Scalars respectively, things have been built around and
differentiated around the concepts. Maybe trying to differentiate this
information at runtime isn't the correct path, but I believe there's a
demonstrated desire for being able to differentiate things in a library
agnostic way.

On Tue, Apr 23, 2024 at 8:37 PM David Li <[email protected]> wrote:

For scalars: Arrow doesn't define scalars. They're an implementation
concept. (They may be a *useful* one, but if we want to define them more
generally, that's a separate discussion.)

For UDFs: UDFs are a system-specific interface. Presumably, that

interface

can encode whether an Arrow array is meant to represent a column or

scalar

(or record batch or ...). Again, because Arrow doesn't define scalars

(for

now...) or UDFs, the UDF interface needs to layer its own semantics on

top

of Arrow.

In other words, I don't think the C Data Interface was meant to be
something where you can expect to _only_ pass the ArrowDeviceArray around
and have it encode all the semantics for a particular system, right? The
UDF example is something where the engine would pass an ArrowDeviceArray
plus additional context.

since we can't determine which a given ArrowArray is on its own. In the
libcudf situation, it came up with what happens if you pass a

non-struct

column to the from_arrow_device method which returns a cudf::table?

Should

it error, or should it create a table with a single column?


Presumably it should just error? I can see this being ambiguous if there
were an API that dynamically returned either a table or a column based on
the input shape (where before it would be less ambiguous since you'd
explicitly pass pa.RecordBatch or pa.Array, and now it would be ambiguous
since you only pass ArrowDeviceArray). But it doesn't sound like that's

the

case?

On Tue, Apr 23, 2024, at 11:15, Weston Pace wrote:

I tend to agree with Dewey.  Using run-end-encoding to represent a

scalar

is clever and would keep the c data interface more compact.  Also, a

struct

array is a superset of a record batch (assuming the metadata is kept in

the

schema).  Consumers should always be able to deserialize into a struct
array and then downcast to a record batch if that is what they want to

do

(raising an error if there happen to be nulls).

Depending on the function in question, it could be valid to pass a

struct

column vs a record batch with different results.


Are there any concrete examples where this is the case?  The closest
example I can think of is something like the `drop_nulls` function,

which,

given a record batch, would choose to drop rows where any column is

null

and, given an array, only drops rows where the top-level struct is

null.

However, it might be clearer to just give the two functions different

names

anyways.

On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
<[email protected]> wrote:

Thank you for the background!

I still wonder if these distinctions are the responsibility of the
ArrowSchema to communicate (although perhaps links to the specific
discussions would help highlight use-cases that I am not envisioning).
I think these distinctions are definitely important in the contexts
you mentioned; however, I am not sure that the FFI layer is going to
be helpful.

In the libcudf situation, it came up with what happens if you pass a

non-struct

column to the from_arrow_device method which returns a cudf::table?

Should

it error, or should it create a table with a single column?


I suppose that I would have expected two functions (one to create a
table and one to create a column). As a consumer I can't envision a
situation where I would want to import an ArrowDeviceArray but where I
would want some piece of run-time information to decide what the
return type of the function would be? (With apologies if I am missing
a piece of the discussion).

If A and B have different lengths, this is invalid


I believe several array implementations (e.g., numpy, R) are able to
broadcast/recycle a length-1 array. Run-end-encoding is also an option
that would make that broadcast explicit without expanding the scalar.

Depending on the function in question, it could be valid to pass a

struct column vs a record batch with different results.

If this is an important distinction for an FFI signature of a UDF,
there would probably be a struct definition for the UDF where there
would be an opportunity to make this distinction (and perhaps others
that are relevant) without loading this concept onto the existing
structs.

If no flags are set, then the behavior shouldn't change
from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,

then

it

should error unless calling ImportRecordBatch.


I am not sure I would have expected that (since a struct array has an
unambiguous interpretation as a record batch and as a user I've very
explicitly decided that I want one, since I'm using that function).

In the other direction, I am not sure a producer would be able to set
these flags without breaking backwards compatibility with earlier
producers that did not set them (since earlier threads have suggested
that it is good practice to error when an unsupported flag is
encountered).

On Sun, Apr 21, 2024 at 6:16 PM Matt Topol <[email protected]>

wrote:


First, I forgot a flag in my examples. There should also be an
ARROW_FLAG_SCALAR too!

The motivation for this distinction came up from discussions during

adding

support for ArrowDeviceArray to libcudf in order to better indicate

the

difference between a cudf::table and a cudf::column which are

handled

quite

differently. This also relates to the fact that we currently need

external

context like the explicit ImportArray() and ImportRecordBatch()

functions

since we can't determine which a given ArrowArray is on its own. In

the

libcudf situation, it came up with what happens if you pass a

non-struct

column to the from_arrow_device method which returns a cudf::table?

Should

it error, or should it create a table with a single column?

The other motivation for this distinction is with UDFs in an engine

that

uses the C data interface. When dealing with queries and engines, it
becomes important to be able to distinguish between a record batch,

column and a scalar. For example, take the expression A + B:

If A and B have different lengths, this is invalid..... unless one

of

them

is a Scalar. This is because Scalars are broadcastable, columns are

not.


Depending on the function in question, it could be valid to pass a

struct

column vs a record batch with different results. It also resolves

some

ambiguity for UDFs and processing. For instance, given a single

ArrowArray

of length 1, which is a struct: Is that a Struct Column? A Record

Batch?

or

is it a scalar? There's no way to know what the producer's intention

was

or

the context without these flags or having to side-channel the

information

somehow.

It seems like it may cause some ambiguous

situations...should C++'s ImportArray() error, for example, if the
schema has a ARROW_FLAG_RECORD_BATCH flag?

I would argue yes. If no flags are set, then the behavior shouldn't

change

from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set,

then

it

should error unless calling ImportRecordBatch. It allows the

producer

to

provide context as to the source and intention of the structure of

the

data.


--Matt

On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
<[email protected]> wrote:

Thanks for bringing this up!

Could you share the motivation where this distinction is important

in

the context of transfer across the C data interface? The "struct

==

record batch" concept has always made sense to me because in R, a
data.frame can have a column that is also a data.frame and there

is

no

distinction between the two. It seems like it may cause some

ambiguous

situations...should C++'s ImportArray() error, for example, if the
schema has a ARROW_FLAG_RECORD_BATCH flag?

Cheers,

-dewey

On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <

[email protected]>

wrote:


Hey everyone,

With some of the other developments surrounding libraries

adopting

the

Arrow C Data interfaces, there's been a consistent question

about

handling

tables (record batch) vs columns vs scalars.

Right now, a Record Batch is sent through the C interface as a

struct

column whose children are the individual columns of the batch

and

Scalar

would be sent through as just an array of length 1. Applications

would

have

to create their own contextual way of indicating whether the

Array

being

passed should be interpreted as just a single array/column or

should

be

treated as a full table/record batch.

Rather than introducing new members or otherwise complicating

the

structs,

I wanted to gauge how people felt about introducing new flags

for

the

ArrowSchema object.

Right now, we only have 3 defined flags:

ARROW_FLAG_DICTIONARY_ORDERED
ARROW_FLAG_NULLABLE
ARROW_FLAG_MAP_KEYS_SORTED

The flags member of the struct is an int64, so we have another

bits to

play with! If no one has any strong objections, I wanted to

propose

adding

at least 2 new flags:

ARROW_FLAG_RECORD_BATCH
ARROW_FLAG_SINGLE_COLUMN

If neither flag is set, then it is contextual as to whether it

should be

expected that the corresponding data is a table or a single

column.

If

ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST

be a

struct array and should be interpreted as a record batch by any

consumers

(erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then

the

corresponding ArrowArray should be interpreted and utilized as a

single

array/column regardless of its type.

This provides a standardized way for producers of Arrow data to

indicate

in

the schema to consumers how the data they produced should be

used

(as a

table or column) rather than forcing everyone to come up with

their

own

contextualized way of handling things (extra arguments,

differently

named

functions for RecordBatch / Array, etc.).

If there's no objections to this, I'll take a pass at

implementing

these

flags in C++ and Go to put up a PR and make a Vote thread. I

just

wanted

to

see what others on the mailing list thought before I go ahead

and

put

effort into this.

Thanks everyone! Take care!

--Matt

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Reply via email to