Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Andrew Lamb Tue, 11 Jun 2024 02:59:16 -0700

> Interesting. A separated but related discussion[10] will use
a record batch or a map array for statistics and it includes
multiple statistic items. But arrow-rs (DataFusion) uses an
Arrow array per statistic item.


The reason for using an array per item (rather than a RecordBatch) is so
the statistics can be extracted only when needed -- as in if only mins are
needed, there is no reason to also extract maxes and null counts. I don't
know how important this is in practice

> It seems that datafunsion::common::Statistics[2] (this is a
higher level statistics object, right?) doesn't use Arrow
arrays.

That is correct

> When extracted Parquet statistics are converted to
datafunsion::common::Statistics from Arrow arrays in
DataFusion? (It's WIP?)

It is currently done in [1], though I expect this code will improve over
time

Andrew

[1]:
https://github.com/apache/datafusion/blob/47026a2a3dd41a5c87e44ade58d91a89feba147b/datafusion/core/src/datasource/file_format/parquet.rs#L458-L536



On Tue, Jun 11, 2024 at 3:49 AM Sutou Kouhei <[email protected]> wrote:

> Hi,
>
> Thanks for sharing arrow-rs related information!
>
> > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP
> but
> > I plan to propose upstreaming[4] to arrow-rs when complete)
>
> Interesting. A separated but related discussion[10] will use
> a record batch or a map array for statistics and it includes
> multiple statistic items. But arrow-rs (DataFusion) uses an
> Arrow array per statistic item.
>
> It seems that datafunsion::common::Statistics[2] (this is a
> higher level statistics object, right?) doesn't use Arrow
> arrays. When extracted Parquet statistics are converted to
> datafunsion::common::Statistics from Arrow arrays in
> DataFusion? (It's WIP?)
>
>
> [10] https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx
>
>
> Thanks,
> --
> kou
>
> In <CAFhtnRwgixMc_vFmogNZ7V=46mjg53md+dsftcw_c5qvyhs...@mail.gmail.com>
>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Mon, 10
> Jun 2024 11:26:23 -0400,
>   Andrew Lamb <[email protected]> wrote:
>
> >> (Does arrow-rs compute statistics from in-memory Arrow array?)
> >
> > Not really, though there are kernels[1] to do so for some types
> >
> > We have two related concepts in the Rust ecosystem:
> > 1. Full on statistics in DataFusion [2] (though no great way at the
> moment)
> > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP
> but
> > I plan to propose upstreaming[4] to arrow-rs when complete)
> >
> > I think that  code to extract statistics from Parquet files as arrow
> arrays
> > is a very important feature (and lets query engines do row group and data
> > page prunng).
> >
> > The value of a  higher level Statistics object is a little less clear to
> me
> > -- query engines end up with all sorts of complicated calculations on
> such
> > objects (like predicate selectivity, NDV estimation, boundary analysis,
> > etc) that finding what level makes sense in arrow might be hard.
> >
> > Andrew
> >
> > [1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html
> > [2]:
> >
> https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html
> > [3]:
> >
> https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579
> > [4]: https://github.com/apache/arrow-rs/issues/4328
> >
> > On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Thanks for your comment.
> >>
> >> You may misunderstand my motivation.
> >>
> >> This proposal doesn't change the Apache Arrow columnar
> >> format. For example, this proposal doesn't save statistics
> >> read from Apache Parquet file to Apache Arrow IPC file. This
> >> proposal just attaches statistics read from Apache Parquet
> >> file to in-memory arrow::Array C++ objects. It's just for
> >> easy to use in-memory arrow::Array C++ objects.
> >>
> >> This proposal doesn't compute statistics from in-memory
> >> arrow::Array C++ objects. (We may want to do it later but
> >> this proposal doesn't propose it.)
> >>
> >> (Does arrow-rs compute statistics from in-memory Arrow
> >> array?)
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com>
> >>   "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu,
> 6
> >> Jun 2024 08:13:11 +0200,
> >>   Jorge Cardoso Leitão <[email protected]> wrote:
> >>
> >> > Hi
> >> >
> >> > This is c++ specific, but imo the question applies more broadly.
> >> >
> >> > I understood that the rationale for stats in compressed+encoded
> formats
> >> > like parquet is that computing those stats has a high cost (io +
> >> decompress
> >> > + decode + aggregate). This motivates the materialization of
> aggregates.
> >> >
> >> > In arrow the data is already in an in-memory format (e.g. IPC+mmap,
> or in
> >> > the heap) and the cost is thus smaller (aggregate).
> >> >
> >> > It could be useful to quantify how much is being saved vs how much
> >> > complexity is being added to the format + implementations.
> >> >
> >> > Best,
> >> > Jorge
> >> >
> >> >
> >> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <[email protected]>
> >> wrote:
> >> >
> >> >> Generally I think this is a good idea that has been proposed before
> but
> >> I
> >> >> don't think we could ever make progress on design.
> >> >>
> >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <[email protected]>
> wrote:
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > Related GitHub issue:
> >> >> > https://github.com/apache/arrow/issues/41909
> >> >> >
> >> >> > How about adding arrow::ArrayStatistics?
> >> >> >
> >> >> > Motivation:
> >> >> >
> >> >> > An Apache Arrow format data doesn't have statistics. (We can
> >> >> > add statistics as metadata but there isn't any standard way
> >> >> > for it.)
> >> >> >
> >> >> > But a source of an Apache Arrow format data such as Apache
> >> >> > Parquet format data may have statistics. We can get the
> >> >> > source statistics via source reader such as
> >> >> > parquet::ColumnChunkMetaData::statistics() but can't get
> >> >> > them from read Apache Arrow format data. If we want to use
> >> >> > the source statistics, we need to keep the source reader.
> >> >> >
> >> >> > Proposal:
> >> >> >
> >> >> > How about adding arrow::ArrayStatistics or something and
> >> >> > attaching source statistics to read arrow::Array? If source
> >> >> > statistics are attached to read arrow::Array, we don't need
> >> >> > to keep a source reader to get source statistics.
> >> >> >
> >> >> > What do you think about this idea?
> >> >> >
> >> >> >
> >> >> > NOTE: I haven't thought about the arrow::ArrayStatistics
> >> >> > details yet. We'll be able to use parquet::Statistics and
> >> >> > its family as a reference.
> >> >> >
> >> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h
> >> >> >
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > kou
> >> >> >
> >> >>
> >>
>

Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?

Reply via email to