> Interesting. A separated but related discussion[10] will use a record batch or a map array for statistics and it includes multiple statistic items. But arrow-rs (DataFusion) uses an Arrow array per statistic item.
The reason for using an array per item (rather than a RecordBatch) is so the statistics can be extracted only when needed -- as in if only mins are needed, there is no reason to also extract maxes and null counts. I don't know how important this is in practice > It seems that datafunsion::common::Statistics[2] (this is a higher level statistics object, right?) doesn't use Arrow arrays. That is correct > When extracted Parquet statistics are converted to datafunsion::common::Statistics from Arrow arrays in DataFusion? (It's WIP?) It is currently done in [1], though I expect this code will improve over time Andrew [1]: https://github.com/apache/datafusion/blob/47026a2a3dd41a5c87e44ade58d91a89feba147b/datafusion/core/src/datasource/file_format/parquet.rs#L458-L536 On Tue, Jun 11, 2024 at 3:49 AM Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > Thanks for sharing arrow-rs related information! > > > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP > but > > I plan to propose upstreaming[4] to arrow-rs when complete) > > Interesting. A separated but related discussion[10] will use > a record batch or a map array for statistics and it includes > multiple statistic items. But arrow-rs (DataFusion) uses an > Arrow array per statistic item. > > It seems that datafunsion::common::Statistics[2] (this is a > higher level statistics object, right?) doesn't use Arrow > arrays. When extracted Parquet statistics are converted to > datafunsion::common::Statistics from Arrow arrays in > DataFusion? (It's WIP?) > > > [10] https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx > > > Thanks, > -- > kou > > In <CAFhtnRwgixMc_vFmogNZ7V=46mjg53md+dsftcw_c5qvyhs...@mail.gmail.com> > "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Mon, 10 > Jun 2024 11:26:23 -0400, > Andrew Lamb <al...@influxdata.com> wrote: > > >> (Does arrow-rs compute statistics from in-memory Arrow array?) > > > > Not really, though there are kernels[1] to do so for some types > > > > We have two related concepts in the Rust ecosystem: > > 1. Full on statistics in DataFusion [2] (though no great way at the > moment) > > 2. Code to extract parquet statistics as Arrow arrays[3] (this is a WIP > but > > I plan to propose upstreaming[4] to arrow-rs when complete) > > > > I think that code to extract statistics from Parquet files as arrow > arrays > > is a very important feature (and lets query engines do row group and data > > page prunng). > > > > The value of a higher level Statistics object is a little less clear to > me > > -- query engines end up with all sorts of complicated calculations on > such > > objects (like predicate selectivity, NDV estimation, boundary analysis, > > etc) that finding what level makes sense in arrow might be hard. > > > > Andrew > > > > [1]: https://docs.rs/arrow/latest/arrow/compute/fn.min.html > > [2]: > > > https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html > > [3]: > > > https://github.com/apache/datafusion/blob/e094f94d2a3f23128ce782a20982dbf7ac1ebed2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L579 > > [4]: https://github.com/apache/arrow-rs/issues/4328 > > > > On Sun, Jun 9, 2024 at 3:40 AM Sutou Kouhei <k...@clear-code.com> wrote: > > > >> Hi, > >> > >> Thanks for your comment. > >> > >> You may misunderstand my motivation. > >> > >> This proposal doesn't change the Apache Arrow columnar > >> format. For example, this proposal doesn't save statistics > >> read from Apache Parquet file to Apache Arrow IPC file. This > >> proposal just attaches statistics read from Apache Parquet > >> file to in-memory arrow::Array C++ objects. It's just for > >> easy to use in-memory arrow::Array C++ objects. > >> > >> This proposal doesn't compute statistics from in-memory > >> arrow::Array C++ objects. (We may want to do it later but > >> this proposal doesn't propose it.) > >> > >> (Does arrow-rs compute statistics from in-memory Arrow > >> array?) > >> > >> > >> Thanks, > >> -- > >> kou > >> > >> In <CAOYPqDBM0ocns5=t6anzg-bqwmgkervhw_5ru4qomewqtaq...@mail.gmail.com> > >> "Re: [DISCUSS][C++] How about adding arrow::ArrayStatistics?" on Thu, > 6 > >> Jun 2024 08:13:11 +0200, > >> Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > >> > >> > Hi > >> > > >> > This is c++ specific, but imo the question applies more broadly. > >> > > >> > I understood that the rationale for stats in compressed+encoded > formats > >> > like parquet is that computing those stats has a high cost (io + > >> decompress > >> > + decode + aggregate). This motivates the materialization of > aggregates. > >> > > >> > In arrow the data is already in an in-memory format (e.g. IPC+mmap, > or in > >> > the heap) and the cost is thus smaller (aggregate). > >> > > >> > It could be useful to quantify how much is being saved vs how much > >> > complexity is being added to the format + implementations. > >> > > >> > Best, > >> > Jorge > >> > > >> > > >> > On Thu, Jun 6, 2024, 07:55 Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > > >> >> Generally I think this is a good idea that has been proposed before > but > >> I > >> >> don't think we could ever make progress on design. > >> >> > >> >> On Sun, Jun 2, 2024 at 7:17 PM Sutou Kouhei <k...@clear-code.com> > wrote: > >> >> > >> >> > Hi, > >> >> > > >> >> > Related GitHub issue: > >> >> > https://github.com/apache/arrow/issues/41909 > >> >> > > >> >> > How about adding arrow::ArrayStatistics? > >> >> > > >> >> > Motivation: > >> >> > > >> >> > An Apache Arrow format data doesn't have statistics. (We can > >> >> > add statistics as metadata but there isn't any standard way > >> >> > for it.) > >> >> > > >> >> > But a source of an Apache Arrow format data such as Apache > >> >> > Parquet format data may have statistics. We can get the > >> >> > source statistics via source reader such as > >> >> > parquet::ColumnChunkMetaData::statistics() but can't get > >> >> > them from read Apache Arrow format data. If we want to use > >> >> > the source statistics, we need to keep the source reader. > >> >> > > >> >> > Proposal: > >> >> > > >> >> > How about adding arrow::ArrayStatistics or something and > >> >> > attaching source statistics to read arrow::Array? If source > >> >> > statistics are attached to read arrow::Array, we don't need > >> >> > to keep a source reader to get source statistics. > >> >> > > >> >> > What do you think about this idea? > >> >> > > >> >> > > >> >> > NOTE: I haven't thought about the arrow::ArrayStatistics > >> >> > details yet. We'll be able to use parquet::Statistics and > >> >> > its family as a reference. > >> >> > > >> https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h > >> >> > > >> >> > > >> >> > Thanks, > >> >> > -- > >> >> > kou > >> >> > > >> >> > >> >