Re: Hidden partitioning clarification

Ryan Blue Tue, 08 Oct 2019 11:41:13 -0700

> It is not clear to me how partition keys are distributed with respect to
actual files and what constraints exist for partition evolution.

The requirement is that a file contains rows that have the same values for
all partition columns. If you partition by log_level and date(ts), then for
any given file, all rows will have the same log_level and date derived from
the ts field. Files are written for a partition layout because it requires
grouping rows to meet this requirement.

Metadata for each partition layout is kept independently. If you evolve the
partitioning for a table, split planning happens for each layout
independently. The files that will be read are the union of the files that
are left after pruning in each layout.

> If would then follow that later evolutions of partitioning schemes must
be derived only from the original schema and therefore, they must
effectively be a coarser grained rollup (i.e. a year from a date).

This isn't a requirement because layouts are independent. You can go from
hourly partitions to daily partitions.

> Most importantly, queries no longer depend on a table’s physical layout.

This statement means that queries depend on table columns that will not
change. The underlying physical layout is independent. In Hive, the
partition layout changes the table columns, but in Iceberg, you always
query only the table columns. Derived partition data is not directly
exposed, which is why we say it is "hidden".

Because queries don't depend on partition data columns directly, the
partitioning can be changed. For example, if you partitioned in a Hive
table by ts_date (a string), then your query would need a filter like
ts_date > "2019-01-01". If you tried to move to hourly partitioning and
removed the ts_date partition column, queries would fail. In Iceberg, you'd
express this constraint in terms of the data instead: ts >= TIMESTAMP
"2019-01-02T00:00:00.000000". That way, the underlying derived partition
values are not part of the query and you can run the query using either
hourly or daily partitions, or a mix of the two.

On Tue, Oct 8, 2019 at 10:23 AM Elliot West <tea...@gmail.com> wrote:

> ‘If would’ → ‘it would’
> ‘original schema’ → ‘original scheme’
>
> On Tue, 8 Oct 2019 at 18:00, Elliot West <tea...@gmail.com> wrote:
>
>> Hello,
>>
>> I'm trying to understand the underlying partitioning model in Iceberg. It
>> is not clear to me how partition keys are distributed with respect to
>> actual files and what constraints exist for partition evolution. My
>> expectation is that to achieve reasonable read performance, sets of keys
>> must be assigned to specific files so that partition pruning can be
>> effective. If would then follow that later evolutions of partitioning
>> schemes must be derived only from the original schema and therefore, they
>> must effectively be a coarser grained rollup (i.e. a year from a date).
>>
>> Is this correct? I'm unable to discern this explicitly from the
>> documentation as it doesn't mention constraints and it could perhaps be
>> over eagerly be interpreted as describing a panacea:
>>
>> Most importantly, queries no longer depend on a table’s physical layout.
>>> With a separation between physical and logical, Iceberg tables can evolve
>>> partition schemes over time as data volume changes. Misconfigured tables
>>> can be fixed without an expensive migration.
>>
>> https://iceberg.apache.org/partitioning/#icebergs-hidden-partitioning
>>
>> Thanks for your time,
>>
>> Elliot.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Hidden partitioning clarification

Reply via email to