First of all, thank you for driving this proposal! I don't think there's
anything particularly bad or wrong about mentioning the C data interface in
the title of the document...my initial comment was mostly a reaction to the
fact that most of the content of the proposal is describing this schema,
which I think is a reflection that the schema itself is an extremely
valuable (and orthogonal) thing to agree on!

> Or in other words, we can just say, "this is the canonical schema to
represent statistics about an Arrow dataset as Arrow data", without
defining anything about how or where to use it.

+1!

> I think it's still useful to have context/examples of why we are
motivated to define this (and note that these are examples only), which may
use C Data Interface or something as an example

+1! I think it's OK to use a C Data interface example as the only example
(as a reflection that it's the thing that makes the most sense):

A file reader or table-like object providing an API using the Arrow C
Stream interface (e.g., `MyObjectExportAsArrayStream(MyObject* object,
ArrowArrayStream* out)`) may wish to also provide a parallel function
(e.g., `MyObjectGetStatistics(MyObject* object, ArrowArrayStream* out)`)
allowing the caller to inspect the properties of the object to perform the
appropriate query planning or memory allocation.

I think it's fair not to mention any other Arrow-like transport mechanism
since the benefits of transporting the statistics as an Arrow array are
less clear right now.

Cheers,

-dewey

On Wed, Dec 11, 2024 at 7:48 PM David Li <lidav...@apache.org> wrote:

> I think the feedback is more along the lines of: we can just standardize a
> representation of statistics, without referencing where it's used (C Data
> Interface or otherwise). So people are free to use it wherever they want,
> whether C Data Interface or IPC or somewhere else. At the same time, we are
> not saying that we are going to (for example) embed this into IPC files as
> the official way to have Parquet-like statistics. It is simply an agreed
> upon schema for interoperability, and the details of how it is passed
> around are up to the application. (At least for now.)
>
> Or in other words, we can just say, "this is the canonical schema to
> represent statistics about an Arrow dataset as Arrow data", without
> defining anything about how or where to use it.
>
> I think it's still useful to have context/examples of why we are motivated
> to define this (and note that these are examples only), which may use C
> Data Interface or something as an example, but others may disagree.
>
> On Thu, Dec 12, 2024, at 10:27, Sutou Kouhei wrote:
> > Hi,
> >
> > I want to discuss Arrow array representation of statistics
> > and usable contexts of it.
> >
> > Background:
> >
> > We discussed how to pass statistics through the C data
> > interface:
> >
> > * [DISCUSS] Statistics through the C data interface
> >   https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
> > * [VOTE] Statistics through the C data interface
> >   https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd
> > * GH-38837: [Format] Add the specification to pass
> >   statistics through the Arrow C data interface
> >   https://github.com/apache/arrow/pull/43553
> >
> > The latest proposal is that we standardize schema for Arrow
> > array that represents statistics. See the above PR for
> > details.
> >
> > I think that the proposed approach is the best approach for
> > the C data interface. But I'm not sure whether the approach
> > is the best approach for other contexts such as IPC format,
> > Flight, ADBC and so on. So the latest proposal limits its
> > target to only the C data interface.
> >
> > But there are comments that can we standardize this approach
> > for all contexts including the C data interface?
> > I want to discuss this in this thread.
> >
> > Here are related comments so far:
> >
> > * https://github.com/apache/arrow/pull/43553/files#r1871749972
> > * https://github.com/apache/arrow/pull/43553/files#r1704373291
> > * https://github.com/apache/arrow/pull/43553/files#r1871757604
> >
> >
> > Could you share your opinions?
> >
> >
> > If we can remove the C data interface only limitation, I'll
> > open a new PR for it.
> >
> >
> > Thanks,
> > --
> > kou
>

Reply via email to