Hi,

Related GitHub issue:
https://github.com/apache/arrow/issues/41909

How about adding arrow::ArrayStatistics?

Motivation:

An Apache Arrow format data doesn't have statistics. (We can
add statistics as metadata but there isn't any standard way
for it.)

But a source of an Apache Arrow format data such as Apache
Parquet format data may have statistics. We can get the
source statistics via source reader such as
parquet::ColumnChunkMetaData::statistics() but can't get
them from read Apache Arrow format data. If we want to use
the source statistics, we need to keep the source reader.

Proposal:

How about adding arrow::ArrayStatistics or something and
attaching source statistics to read arrow::Array? If source
statistics are attached to read arrow::Array, we don't need
to keep a source reader to get source statistics.

What do you think about this idea?


NOTE: I haven't thought about the arrow::ArrayStatistics
details yet. We'll be able to use parquet::Statistics and
its family as a reference.
https://github.com/apache/arrow/blob/main/cpp/src/parquet/statistics.h


Thanks,
-- 
kou

Reply via email to