Hi Piyush,

You might want to consider having a separate partition stats file for each
partition spec. That way, each stats file contains just one partition
struct type and you can keep the struct unmodified. There is a way to
convert a partition struct to a string (PartitionSpec.partitionToPath) but
that is a one-way conversion and you shouldn't try to parse that string or
consider two partitions equal just because the string is equal.

Making the file specific to a partition spec fixes the problem and allows
you to find the data in each file using the same partition predicates that
we use to locate data files. That will make it easy to find the stats for
the partitions that you're looking for based on some data or partition
query filter.

rb

On Thu, Jan 7, 2021 at 11:15 PM Piyush Vinay Hurpade <
piyush.hurp...@dremio.com> wrote:

> Hi Team,
> Need some help regarding our proposal for a partition-stats file within
> each snapshot. With each snapshot we are proposing a partition-stats avro
> file that contains information about all partitions in the table. So the
> schema we decide to have is *(partition_spec_id(int),
> partition(PartitionData), file_count(int), row_count(long)).* Problem is
> with the 2nd *column(partition)*. When partition evolution happens, the
> schema for PartitionData(PartitionSpec) will change. illustration :
>
> {"partition_spec_id":0,"partition":"PartitionData{data=a}","file_count":2,"row_count":2}
> {"partition_spec_id":0,"partition":"PartitionData{data=b}","file_count":1,"row_count":1}
> {"partition_spec_id":1,"partition":"PartitionData{data=c, 
> id=1}","file_count":1,"row_count":1}
>
> And this will be a problem for reader and writer. We decided to have *the
> partition column as a "String type" and serialize PartitionData to string.*
> Here we want to confirm that "*Can all data types supported in iceberg
> can serialize to String"?* For example if a column in a table has binary
> type and we have a partition on it. can it be serialize to string?
> issue link : https://github.com/apache/iceberg/issues/1832
> <https://github.com/apache/iceberg/issues/1832>
>
> thanks and regards
> --
>
> Piyush Hurpade
>
> Software Engineer
>
> piyush.hurp...@dremio.com
>
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to