Hey Denys,
I have gone through the proposal and it seems adding additional stats to
partition stats spec is requested by both Trino and Hive community.
I feel keeping column level stats per partition will definitely help the
query planner to make better decisions.
I saw that document didn't get en
Hi All,
After reviewing Iceberg's proposals for stats, checking the code and reading
the comments, I've created a DRAFT proposal for the partition-level column
stats. Would be great to continue discussion on that topic and share the ideas.
https://docs.google.com/document/d/11Rp-irqb4L4Qpdxr6l8
Thanks All for the reactions.
I wanted to emphasize that Hive's StatsObject was shared as an example with the
suggestion to adapt it for iceberg - `PartitionColumnStats` (i.e. use column
ids and drop name/type, etc).
As was mentioned by Rayan, column upper/lower bounds, counts, null value and
Thanks Denys for starting this discussion!
Thanks Ryan, i agree it would be better to have engine agnostic data
structures in the Blobs we maintain in the Iceberg project. At least for
the "standard blob types".
Note however that Puffin format is intentionally open-ended. An application
can put a
Thanks for proposing this.
My main concern is that this doesn't seem to be aimed at standardizing this
metadata, but rather a way to pass existing Hive structures in a different
way. I commented on the PR, but I'll carry it over here for this discussion.
Iceberg already supports tracking column l
Hi Gabor,
Thanks for your feedback!
> In that use case however, we'd lose the stats we got previously from HMS
For Iceberg tables Hive computes and stores the same stats object in a puffin
file, previously persisted to HMS. So, there shouldn't be any changes for
Impala other than changing the
Hi Denys,
Thanks for raising this! I think extending the Puffin spec with additional
columns stats would make sense.
I saw the PR for the Puffin spec at some point late last year and I also
had it in my plans to revive it in a way. My motivation is that Impala
currently uses a lot of stats from H
There is an option to standardize Hive's ColStatistics object schema and use
Iceberg:
class ColStatistics {
static class Range {
Number minValue;
Number maxValue;
}
String colName;
String colType;
long countDistinct;
long numNulls;
double avgColLen;
long numTrues;
lo
sorry, valid Doc PR link:
https://github.com/apache/iceberg-docs/pull/269