Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Dewey Dunnington Mon, 22 Apr 2024 13:01:23 -0700

Thank you for the background!

I still wonder if these distinctions are the responsibility of the
ArrowSchema to communicate (although perhaps links to the specific
discussions would help highlight use-cases that I am not envisioning).
I think these distinctions are definitely important in the contexts
you mentioned; however, I am not sure that the FFI layer is going to
be helpful.


> In the libcudf situation, it came up with what happens if you pass a 
> non-struct
> column to the from_arrow_device method which returns a cudf::table? Should
> it error, or should it create a table with a single column?

I suppose that I would have expected two functions (one to create a
table and one to create a column). As a consumer I can't envision a
situation where I would want to import an ArrowDeviceArray but where I
would want some piece of run-time information to decide what the
return type of the function would be? (With apologies if I am missing
a piece of the discussion).

> If A and B have different lengths, this is invalid

I believe several array implementations (e.g., numpy, R) are able to
broadcast/recycle a length-1 array. Run-end-encoding is also an option
that would make that broadcast explicit without expanding the scalar.

> Depending on the function in question, it could be valid to pass a struct 
> column vs a record batch with different results.

If this is an important distinction for an FFI signature of a UDF,
there would probably be a struct definition for the UDF where there
would be an opportunity to make this distinction (and perhaps others
that are relevant) without loading this concept onto the existing
structs.

> If no flags are set, then the behavior shouldn't change
> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> should error unless calling ImportRecordBatch.

I am not sure I would have expected that (since a struct array has an
unambiguous interpretation as a record batch and as a user I've very
explicitly decided that I want one, since I'm using that function).

In the other direction, I am not sure a producer would be able to set
these flags without breaking backwards compatibility with earlier
producers that did not set them (since earlier threads have suggested
that it is good practice to error when an unsupported flag is
encountered).

On Sun, Apr 21, 2024 at 6:16 PM Matt Topol <[email protected]> wrote:
>
> First, I forgot a flag in my examples. There should also be an
> ARROW_FLAG_SCALAR too!
>
> The motivation for this distinction came up from discussions during adding
> support for ArrowDeviceArray to libcudf in order to better indicate the
> difference between a cudf::table and a cudf::column which are handled quite
> differently. This also relates to the fact that we currently need external
> context like the explicit ImportArray() and ImportRecordBatch() functions
> since we can't determine which a given ArrowArray is on its own. In the
> libcudf situation, it came up with what happens if you pass a non-struct
> column to the from_arrow_device method which returns a cudf::table? Should
> it error, or should it create a table with a single column?
>
> The other motivation for this distinction is with UDFs in an engine that
> uses the C data interface. When dealing with queries and engines, it
> becomes important to be able to distinguish between a record batch, a
> column and a scalar. For example, take the expression A + B:
>
> If A and B have different lengths, this is invalid..... unless one of them
> is a Scalar. This is because Scalars are broadcastable, columns are not.
>
> Depending on the function in question, it could be valid to pass a struct
> column vs a record batch with different results. It also resolves some
> ambiguity for UDFs and processing. For instance, given a single ArrowArray
> of length 1, which is a struct: Is that a Struct Column? A Record Batch? or
> is it a scalar? There's no way to know what the producer's intention was or
> the context without these flags or having to side-channel the information
> somehow.
>
> > It seems like it may cause some ambiguous
> situations...should C++'s ImportArray() error, for example, if the
> schema has a ARROW_FLAG_RECORD_BATCH flag?
>
> I would argue yes. If no flags are set, then the behavior shouldn't change
> from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> should error unless calling ImportRecordBatch. It allows the producer to
> provide context as to the source and intention of the structure of the data.
>
> --Matt
>
> On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
> <[email protected]> wrote:
>
> > Thanks for bringing this up!
> >
> > Could you share the motivation where this distinction is important in
> > the context of transfer across the C data interface? The "struct ==
> > record batch" concept has always made sense to me because in R, a
> > data.frame can have a column that is also a data.frame and there is no
> > distinction between the two. It seems like it may cause some ambiguous
> > situations...should C++'s ImportArray() error, for example, if the
> > schema has a ARROW_FLAG_RECORD_BATCH flag?
> >
> > Cheers,
> >
> > -dewey
> >
> > On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <[email protected]> wrote:
> > >
> > > Hey everyone,
> > >
> > > With some of the other developments surrounding libraries adopting the
> > > Arrow C Data interfaces, there's been a consistent question about
> > handling
> > > tables (record batch) vs columns vs scalars.
> > >
> > > Right now, a Record Batch is sent through the C interface as a struct
> > > column whose children are the individual columns of the batch and a
> > Scalar
> > > would be sent through as just an array of length 1. Applications would
> > have
> > > to create their own contextual way of indicating whether the Array being
> > > passed should be interpreted as just a single array/column or should be
> > > treated as a full table/record batch.
> > >
> > > Rather than introducing new members or otherwise complicating the
> > structs,
> > > I wanted to gauge how people felt about introducing new flags for the
> > > ArrowSchema object.
> > >
> > > Right now, we only have 3 defined flags:
> > >
> > > ARROW_FLAG_DICTIONARY_ORDERED
> > > ARROW_FLAG_NULLABLE
> > > ARROW_FLAG_MAP_KEYS_SORTED
> > >
> > > The flags member of the struct is an int64, so we have another 61 bits to
> > > play with! If no one has any strong objections, I wanted to propose
> > adding
> > > at least 2 new flags:
> > >
> > > ARROW_FLAG_RECORD_BATCH
> > > ARROW_FLAG_SINGLE_COLUMN
> > >
> > > If neither flag is set, then it is contextual as to whether it should be
> > > expected that the corresponding data is a table or a single column. If
> > > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a
> > > struct array and should be interpreted as a record batch by any consumers
> > > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the
> > > corresponding ArrowArray should be interpreted and utilized as a single
> > > array/column regardless of its type.
> > >
> > > This provides a standardized way for producers of Arrow data to indicate
> > in
> > > the schema to consumers how the data they produced should be used (as a
> > > table or column) rather than forcing everyone to come up with their own
> > > contextualized way of handling things (extra arguments, differently named
> > > functions for RecordBatch / Array, etc.).
> > >
> > > If there's no objections to this, I'll take a pass at implementing these
> > > flags in C++ and Go to put up a PR and make a Vote thread. I just wanted
> > to
> > > see what others on the mailing list thought before I go ahead and put
> > > effort into this.
> > >
> > > Thanks everyone! Take care!
> > >
> > > --Matt
> >

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Reply via email to