Re: Sort Spec

2019-07-19 Thread Anton Okolnychyi
I think we are all on the same page. By that statement, I meant that we should not assume the current sort order is always applied to all files in the table, as that would require rewriting data immediately when we change the sort order. Also, different parts of the table can be ordered differen

Re: Sort Spec

2019-07-18 Thread Ryan Blue
Yes, I agree. My point is that we have to support cases where data is not yet optimized. That's why I suggest we match up sort order of deltas with sort order of data files. Most of the time, this should be fine but we can't assume that it always will be. On Thu, Jul 18, 2019 at 10:09 AM Owen O'Ma

Re: Sort Spec

2019-07-18 Thread Owen O'Malley
I agree that we need to manage changes to the sort order, just like we need to handle changes to the schema. Neither one should require rewriting data immediately, but when data is compacted or restated, it could be sorted to the new order. .. Owen On Thu, Jul 18, 2019 at 10:01 AM Ryan Blue wrot

Re: Sort Spec

2019-07-18 Thread Ryan Blue
> This one seems really problematic. Too many important optimizations depend on the file sort order. Can we have the writer verify the sort order as the files are written Even if we did, when the desired sort order changes we can't just rewrite all of the data in the table. I think that this will

Re: Sort Spec

2019-07-18 Thread Owen O'Malley
On Thu, Jul 18, 2019 at 5:30 AM Anton Okolnychyi wrote: > Let me summarize what we talked here and follow up with a PR. > > - Iceberg should allow users to define a sort oder in its metadata that > applies to partitions. > - We should never assume the sort order is actually applied to all files >

Re: Sort Spec

2019-07-18 Thread Anton Okolnychyi
Let me summarize what we talked here and follow up with a PR. - Iceberg should allow users to define a sort oder in its metadata that applies to partitions. - We should never assume the sort order is actually applied to all files in the table. - Sort orders might evolve and change over time. Whe

Re: Sort Spec

2019-07-16 Thread Ryan Blue
I agree that Iceberg metadata should include a way to configure a desired sort order. But I want to note that I don’t think that we can ever assume that it has been applied. Table configuration will evolve as use changes. We don’t want to require rewrites when a configuration gets updated, so an as

Re: Sort Spec

2019-07-04 Thread Anton Okolnychyi
In order to begin prototyping, I would start with the following questions. 1) Does Iceberg need a sort spec? - I would say yes 2) Should Iceberg allow users to define a sort spec only if the table is bucketed? - I would say no, as it seems valid to have partitioned and sorted tab

Re: Sort Spec

2019-07-01 Thread Owen O'Malley
My thought is just like Iceberg has to define partitioning and bucketing, it has to define a canonical sort order. In particular, we can’t afford to have Spark, Presto, and Hive writing files in different orders. I believe the right approach is to define a sort order as a series of columns where