Re: [DISCUSS] Arrow array representation of statistics

Felipe Oliveira Carvalho Tue, 17 Dec 2024 04:37:35 -0800

> I think it's fair not to mention any other Arrow-like transport mechanism
> since the benefits of transporting the statistics as an Arrow array are
> less clear right now.


When we (or applications) start thinking about more advanced statistics
like compressed histograms and sketch data structures, using a columnar
format (Arrow itself!) to represent the data, the benefits will be more
clear.

--
Felipe

On Fri, Dec 13, 2024 at 11:52 PM Dewey Dunnington <
[email protected]> wrote:

> First of all, thank you for driving this proposal! I don't think there's
> anything particularly bad or wrong about mentioning the C data interface in
> the title of the document...my initial comment was mostly a reaction to the
> fact that most of the content of the proposal is describing this schema,
> which I think is a reflection that the schema itself is an extremely
> valuable (and orthogonal) thing to agree on!
>
> > Or in other words, we can just say, "this is the canonical schema to
> represent statistics about an Arrow dataset as Arrow data", without
> defining anything about how or where to use it.
>
> +1!
>
> > I think it's still useful to have context/examples of why we are
> motivated to define this (and note that these are examples only), which may
> use C Data Interface or something as an example
>
> +1! I think it's OK to use a C Data interface example as the only example
> (as a reflection that it's the thing that makes the most sense):
>
> A file reader or table-like object providing an API using the Arrow C
> Stream interface (e.g., `MyObjectExportAsArrayStream(MyObject* object,
> ArrowArrayStream* out)`) may wish to also provide a parallel function
> (e.g., `MyObjectGetStatistics(MyObject* object, ArrowArrayStream* out)`)
> allowing the caller to inspect the properties of the object to perform the
> appropriate query planning or memory allocation.
>
> I think it's fair not to mention any other Arrow-like transport mechanism
> since the benefits of transporting the statistics as an Arrow array are
> less clear right now.
>
> Cheers,
>
> -dewey
>
> On Wed, Dec 11, 2024 at 7:48 PM David Li <[email protected]> wrote:
>
> > I think the feedback is more along the lines of: we can just standardize
> a
> > representation of statistics, without referencing where it's used (C Data
> > Interface or otherwise). So people are free to use it wherever they want,
> > whether C Data Interface or IPC or somewhere else. At the same time, we
> are
> > not saying that we are going to (for example) embed this into IPC files
> as
> > the official way to have Parquet-like statistics. It is simply an agreed
> > upon schema for interoperability, and the details of how it is passed
> > around are up to the application. (At least for now.)
> >
> > Or in other words, we can just say, "this is the canonical schema to
> > represent statistics about an Arrow dataset as Arrow data", without
> > defining anything about how or where to use it.
> >
> > I think it's still useful to have context/examples of why we are
> motivated
> > to define this (and note that these are examples only), which may use C
> > Data Interface or something as an example, but others may disagree.
> >
> > On Thu, Dec 12, 2024, at 10:27, Sutou Kouhei wrote:
> > > Hi,
> > >
> > > I want to discuss Arrow array representation of statistics
> > > and usable contexts of it.
> > >
> > > Background:
> > >
> > > We discussed how to pass statistics through the C data
> > > interface:
> > >
> > > * [DISCUSS] Statistics through the C data interface
> > >   https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
> > > * [VOTE] Statistics through the C data interface
> > >   https://lists.apache.org/thread/rsw3wsyj68dksc98s5rpdp6dn8hfk0yd
> > > * GH-38837: [Format] Add the specification to pass
> > >   statistics through the Arrow C data interface
> > >   https://github.com/apache/arrow/pull/43553
> > >
> > > The latest proposal is that we standardize schema for Arrow
> > > array that represents statistics. See the above PR for
> > > details.
> > >
> > > I think that the proposed approach is the best approach for
> > > the C data interface. But I'm not sure whether the approach
> > > is the best approach for other contexts such as IPC format,
> > > Flight, ADBC and so on. So the latest proposal limits its
> > > target to only the C data interface.
> > >
> > > But there are comments that can we standardize this approach
> > > for all contexts including the C data interface?
> > > I want to discuss this in this thread.
> > >
> > > Here are related comments so far:
> > >
> > > * https://github.com/apache/arrow/pull/43553/files#r1871749972
> > > * https://github.com/apache/arrow/pull/43553/files#r1704373291
> > > * https://github.com/apache/arrow/pull/43553/files#r1871757604
> > >
> > >
> > > Could you share your opinions?
> > >
> > >
> > > If we can remove the C data interface only limitation, I'll
> > > open a new PR for it.
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> >
>

Re: [DISCUSS] Arrow array representation of statistics

Reply via email to