Re: Hidden partitioning clarification

Elliot West Thu, 10 Oct 2019 09:56:43 -0700

Thank you, understood.This is really neat.

I think I had trouble grasping this because systems are not normally so
flexible!




On Thu, 10 Oct 2019 at 17:36, Ryan Blue <rb...@netflix.com.invalid> wrote:

> The old data would continue to exist in daily partitions, new data would
> be written into hourly partitions.
>
> Iceberg as a format doesn't require you to do anything here. You can go
> back and rewrite the data that you think is worth moving to hourly, or you
> can leave it as it is. We want actions like rewriting old data in a new
> layout to be a choice that you make, depending on whether you think it is
> worth the cost of rewriting the data for the number of queries you expect
> to run on it.
>
> On Thu, Oct 10, 2019 at 6:37 AM Elliot West <tea...@gmail.com> wrote:
>
>> Thank you. I now appreciate the significant benefit of the decoupling the
>> query from the partition scheme.
>>
>> In terms of physical layout, what would an evolution of partitioning from
>> daily to hourly look like? Would one need rewrite the whole table to
>> achieve smaller groupings in files, or does Iceberg support the earlier
>> data continuing to exist in daily partitions, with new files partitioned by
>> hour?
>>
>> Elliot.
>>
>> On Tue, 8 Oct 2019 at 19:40, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>
>>> > It is not clear to me how partition keys are distributed with respect
>>> to actual files and what constraints exist for partition evolution.
>>>
>>> The requirement is that a file contains rows that have the same values
>>> for all partition columns. If you partition by log_level and date(ts), then
>>> for any given file, all rows will have the same log_level and date derived
>>> from the ts field. Files are written for a partition layout because it
>>> requires grouping rows to meet this requirement.
>>>
>>> Metadata for each partition layout is kept independently. If you evolve
>>> the partitioning for a table, split planning happens for each layout
>>> independently. The files that will be read are the union of the files that
>>> are left after pruning in each layout.
>>>
>>> > If would then follow that later evolutions of partitioning schemes
>>> must be derived only from the original schema and therefore, they must
>>> effectively be a coarser grained rollup (i.e. a year from a date).
>>>
>>> This isn't a requirement because layouts are independent. You can go
>>> from hourly partitions to daily partitions.
>>>
>>> > Most importantly, queries no longer depend on a table’s physical
>>> layout.
>>>
>>> This statement means that queries depend on table columns that will not
>>> change. The underlying physical layout is independent. In Hive, the
>>> partition layout changes the table columns, but in Iceberg, you always
>>> query only the table columns. Derived partition data is not directly
>>> exposed, which is why we say it is "hidden".
>>>
>>> Because queries don't depend on partition data columns directly, the
>>> partitioning can be changed. For example, if you partitioned in a Hive
>>> table by ts_date (a string), then your query would need a filter like
>>> ts_date > "2019-01-01". If you tried to move to hourly partitioning and
>>> removed the ts_date partition column, queries would fail. In Iceberg, you'd
>>> express this constraint in terms of the data instead: ts >= TIMESTAMP
>>> "2019-01-02T00:00:00.000000". That way, the underlying derived partition
>>> values are not part of the query and you can run the query using either
>>> hourly or daily partitions, or a mix of the two.
>>>
>>> On Tue, Oct 8, 2019 at 10:23 AM Elliot West <tea...@gmail.com> wrote:
>>>
>>>> ‘If would’ → ‘it would’
>>>> ‘original schema’ → ‘original scheme’
>>>>
>>>> On Tue, 8 Oct 2019 at 18:00, Elliot West <tea...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I'm trying to understand the underlying partitioning model in Iceberg.
>>>>> It is not clear to me how partition keys are distributed with respect to
>>>>> actual files and what constraints exist for partition evolution. My
>>>>> expectation is that to achieve reasonable read performance, sets of keys
>>>>> must be assigned to specific files so that partition pruning can be
>>>>> effective. If would then follow that later evolutions of partitioning
>>>>> schemes must be derived only from the original schema and therefore, they
>>>>> must effectively be a coarser grained rollup (i.e. a year from a date).
>>>>>
>>>>> Is this correct? I'm unable to discern this explicitly from the
>>>>> documentation as it doesn't mention constraints and it could perhaps be
>>>>> over eagerly be interpreted as describing a panacea:
>>>>>
>>>>> Most importantly, queries no longer depend on a table’s physical
>>>>>> layout. With a separation between physical and logical, Iceberg tables 
>>>>>> can
>>>>>> evolve partition schemes over time as data volume changes. Misconfigured
>>>>>> tables can be fixed without an expensive migration.
>>>>>
>>>>> https://iceberg.apache.org/partitioning/#icebergs-hidden-partitioning
>>>>>
>>>>> Thanks for your time,
>>>>>
>>>>> Elliot.
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Hidden partitioning clarification

Reply via email to