Thank you, understood.This is really neat. I think I had trouble grasping this because systems are not normally so flexible!
On Thu, 10 Oct 2019 at 17:36, Ryan Blue <rb...@netflix.com.invalid> wrote: > The old data would continue to exist in daily partitions, new data would > be written into hourly partitions. > > Iceberg as a format doesn't require you to do anything here. You can go > back and rewrite the data that you think is worth moving to hourly, or you > can leave it as it is. We want actions like rewriting old data in a new > layout to be a choice that you make, depending on whether you think it is > worth the cost of rewriting the data for the number of queries you expect > to run on it. > > On Thu, Oct 10, 2019 at 6:37 AM Elliot West <tea...@gmail.com> wrote: > >> Thank you. I now appreciate the significant benefit of the decoupling the >> query from the partition scheme. >> >> In terms of physical layout, what would an evolution of partitioning from >> daily to hourly look like? Would one need rewrite the whole table to >> achieve smaller groupings in files, or does Iceberg support the earlier >> data continuing to exist in daily partitions, with new files partitioned by >> hour? >> >> Elliot. >> >> On Tue, 8 Oct 2019 at 19:40, Ryan Blue <rb...@netflix.com.invalid> wrote: >> >>> > It is not clear to me how partition keys are distributed with respect >>> to actual files and what constraints exist for partition evolution. >>> >>> The requirement is that a file contains rows that have the same values >>> for all partition columns. If you partition by log_level and date(ts), then >>> for any given file, all rows will have the same log_level and date derived >>> from the ts field. Files are written for a partition layout because it >>> requires grouping rows to meet this requirement. >>> >>> Metadata for each partition layout is kept independently. If you evolve >>> the partitioning for a table, split planning happens for each layout >>> independently. The files that will be read are the union of the files that >>> are left after pruning in each layout. >>> >>> > If would then follow that later evolutions of partitioning schemes >>> must be derived only from the original schema and therefore, they must >>> effectively be a coarser grained rollup (i.e. a year from a date). >>> >>> This isn't a requirement because layouts are independent. You can go >>> from hourly partitions to daily partitions. >>> >>> > Most importantly, queries no longer depend on a table’s physical >>> layout. >>> >>> This statement means that queries depend on table columns that will not >>> change. The underlying physical layout is independent. In Hive, the >>> partition layout changes the table columns, but in Iceberg, you always >>> query only the table columns. Derived partition data is not directly >>> exposed, which is why we say it is "hidden". >>> >>> Because queries don't depend on partition data columns directly, the >>> partitioning can be changed. For example, if you partitioned in a Hive >>> table by ts_date (a string), then your query would need a filter like >>> ts_date > "2019-01-01". If you tried to move to hourly partitioning and >>> removed the ts_date partition column, queries would fail. In Iceberg, you'd >>> express this constraint in terms of the data instead: ts >= TIMESTAMP >>> "2019-01-02T00:00:00.000000". That way, the underlying derived partition >>> values are not part of the query and you can run the query using either >>> hourly or daily partitions, or a mix of the two. >>> >>> On Tue, Oct 8, 2019 at 10:23 AM Elliot West <tea...@gmail.com> wrote: >>> >>>> ‘If would’ → ‘it would’ >>>> ‘original schema’ → ‘original scheme’ >>>> >>>> On Tue, 8 Oct 2019 at 18:00, Elliot West <tea...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I'm trying to understand the underlying partitioning model in Iceberg. >>>>> It is not clear to me how partition keys are distributed with respect to >>>>> actual files and what constraints exist for partition evolution. My >>>>> expectation is that to achieve reasonable read performance, sets of keys >>>>> must be assigned to specific files so that partition pruning can be >>>>> effective. If would then follow that later evolutions of partitioning >>>>> schemes must be derived only from the original schema and therefore, they >>>>> must effectively be a coarser grained rollup (i.e. a year from a date). >>>>> >>>>> Is this correct? I'm unable to discern this explicitly from the >>>>> documentation as it doesn't mention constraints and it could perhaps be >>>>> over eagerly be interpreted as describing a panacea: >>>>> >>>>> Most importantly, queries no longer depend on a table’s physical >>>>>> layout. With a separation between physical and logical, Iceberg tables >>>>>> can >>>>>> evolve partition schemes over time as data volume changes. Misconfigured >>>>>> tables can be fixed without an expensive migration. >>>>> >>>>> https://iceberg.apache.org/partitioning/#icebergs-hidden-partitioning >>>>> >>>>> Thanks for your time, >>>>> >>>>> Elliot. >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >