About proposal for including Partition stats for Iceberg tables

Piyush Vinay Hurpade Thu, 07 Jan 2021 23:15:34 -0800

Hi Team,
Need some help regarding our proposal for a partition-stats file within
each snapshot. With each snapshot we are proposing a partition-stats avro
file that contains information about all partitions in the table. So the
schema we decide to have is *(partition_spec_id(int),
partition(PartitionData), file_count(int), row_count(long)).* Problem is
with the 2nd *column(partition)*. When partition evolution happens, the
schema for PartitionData(PartitionSpec) will change. illustration :


{"partition_spec_id":0,"partition":"PartitionData{data=a}","file_count":2,"row_count":2}
{"partition_spec_id":0,"partition":"PartitionData{data=b}","file_count":1,"row_count":1}
{"partition_spec_id":1,"partition":"PartitionData{data=c,
id=1}","file_count":1,"row_count":1}

And this will be a problem for reader and writer. We decided to have *the
partition column as a "String type" and serialize PartitionData to string.*
Here we want to confirm that "*Can all data types supported in iceberg can
serialize to String"?* For example if a column in a table has binary type
and we have a partition on it. can it be serialize to string?
issue link : https://github.com/apache/iceberg/issues/1832
<https://github.com/apache/iceberg/issues/1832>

thanks and regards
-- 

Piyush Hurpade

Software Engineer

piyush.hurp...@dremio.com

About proposal for including Partition stats for Iceberg tables

Reply via email to