Re: Hidden partitioning clarification

Ryan Blue Thu, 10 Oct 2019 09:36:53 -0700

The old data would continue to exist in daily partitions, new data would be
written into hourly partitions.


Iceberg as a format doesn't require you to do anything here. You can go
back and rewrite the data that you think is worth moving to hourly, or you
can leave it as it is. We want actions like rewriting old data in a new
layout to be a choice that you make, depending on whether you think it is
worth the cost of rewriting the data for the number of queries you expect
to run on it.

On Thu, Oct 10, 2019 at 6:37 AM Elliot West <tea...@gmail.com> wrote:

> Thank you. I now appreciate the significant benefit of the decoupling the
> query from the partition scheme.
>
> In terms of physical layout, what would an evolution of partitioning from
> daily to hourly look like? Would one need rewrite the whole table to
> achieve smaller groupings in files, or does Iceberg support the earlier
> data continuing to exist in daily partitions, with new files partitioned by
> hour?
>
> Elliot.
>
> On Tue, 8 Oct 2019 at 19:40, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> > It is not clear to me how partition keys are distributed with respect
>> to actual files and what constraints exist for partition evolution.
>>
>> The requirement is that a file contains rows that have the same values
>> for all partition columns. If you partition by log_level and date(ts), then
>> for any given file, all rows will have the same log_level and date derived
>> from the ts field. Files are written for a partition layout because it
>> requires grouping rows to meet this requirement.
>>
>> Metadata for each partition layout is kept independently. If you evolve
>> the partitioning for a table, split planning happens for each layout
>> independently. The files that will be read are the union of the files that
>> are left after pruning in each layout.
>>
>> > If would then follow that later evolutions of partitioning schemes must
>> be derived only from the original schema and therefore, they must
>> effectively be a coarser grained rollup (i.e. a year from a date).
>>
>> This isn't a requirement because layouts are independent. You can go from
>> hourly partitions to daily partitions.
>>
>> > Most importantly, queries no longer depend on a table’s physical layout.
>>
>> This statement means that queries depend on table columns that will not
>> change. The underlying physical layout is independent. In Hive, the
>> partition layout changes the table columns, but in Iceberg, you always
>> query only the table columns. Derived partition data is not directly
>> exposed, which is why we say it is "hidden".
>>
>> Because queries don't depend on partition data columns directly, the
>> partitioning can be changed. For example, if you partitioned in a Hive
>> table by ts_date (a string), then your query would need a filter like
>> ts_date > "2019-01-01". If you tried to move to hourly partitioning and
>> removed the ts_date partition column, queries would fail. In Iceberg, you'd
>> express this constraint in terms of the data instead: ts >= TIMESTAMP
>> "2019-01-02T00:00:00.000000". That way, the underlying derived partition
>> values are not part of the query and you can run the query using either
>> hourly or daily partitions, or a mix of the two.
>>
>> On Tue, Oct 8, 2019 at 10:23 AM Elliot West <tea...@gmail.com> wrote:
>>
>>> ‘If would’ → ‘it would’
>>> ‘original schema’ → ‘original scheme’
>>>
>>> On Tue, 8 Oct 2019 at 18:00, Elliot West <tea...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I'm trying to understand the underlying partitioning model in Iceberg.
>>>> It is not clear to me how partition keys are distributed with respect to
>>>> actual files and what constraints exist for partition evolution. My
>>>> expectation is that to achieve reasonable read performance, sets of keys
>>>> must be assigned to specific files so that partition pruning can be
>>>> effective. If would then follow that later evolutions of partitioning
>>>> schemes must be derived only from the original schema and therefore, they
>>>> must effectively be a coarser grained rollup (i.e. a year from a date).
>>>>
>>>> Is this correct? I'm unable to discern this explicitly from the
>>>> documentation as it doesn't mention constraints and it could perhaps be
>>>> over eagerly be interpreted as describing a panacea:
>>>>
>>>> Most importantly, queries no longer depend on a table’s physical
>>>>> layout. With a separation between physical and logical, Iceberg tables can
>>>>> evolve partition schemes over time as data volume changes. Misconfigured
>>>>> tables can be fixed without an expensive migration.
>>>>
>>>> https://iceberg.apache.org/partitioning/#icebergs-hidden-partitioning
>>>>
>>>> Thanks for your time,
>>>>
>>>> Elliot.
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Hidden partitioning clarification

Reply via email to