I think we are all on the same page. By that statement, I meant that we should
not assume the current sort order is always applied to all files in the table,
as that would require rewriting data immediately when we change the sort order.
Also, different parts of the table can be ordered differen
Yes, I agree. My point is that we have to support cases where data is not
yet optimized. That's why I suggest we match up sort order of deltas with
sort order of data files. Most of the time, this should be fine but we
can't assume that it always will be.
On Thu, Jul 18, 2019 at 10:09 AM Owen O'Ma
I agree that we need to manage changes to the sort order, just like we need
to handle changes to the schema. Neither one should require rewriting data
immediately, but when data is compacted or restated, it could be sorted to
the new order.
.. Owen
On Thu, Jul 18, 2019 at 10:01 AM Ryan Blue wrot
> This one seems really problematic. Too many important optimizations
depend on the file sort order. Can we have the writer verify the sort order
as the files are written
Even if we did, when the desired sort order changes we can't just rewrite
all of the data in the table. I think that this will
On Thu, Jul 18, 2019 at 5:30 AM Anton Okolnychyi
wrote:
> Let me summarize what we talked here and follow up with a PR.
>
> - Iceberg should allow users to define a sort oder in its metadata that
> applies to partitions.
> - We should never assume the sort order is actually applied to all files
>
Let me summarize what we talked here and follow up with a PR.
- Iceberg should allow users to define a sort oder in its metadata that applies
to partitions.
- We should never assume the sort order is actually applied to all files in the
table.
- Sort orders might evolve and change over time. Whe
I agree that Iceberg metadata should include a way to configure a desired
sort order. But I want to note that I don’t think that we can ever assume
that it has been applied. Table configuration will evolve as use changes.
We don’t want to require rewrites when a configuration gets updated, so an
as
In order to begin prototyping, I would start with the following questions.
1) Does Iceberg need a sort spec?
- I would say yes
2) Should Iceberg allow users to define a sort spec only if the table is
bucketed?
- I would say no, as it seems valid to have partitioned and sorted
tab
My thought is just like Iceberg has to define partitioning and bucketing, it
has to define a canonical sort order. In particular, we can’t afford to have
Spark, Presto, and Hive writing files in different orders. I believe the right
approach is to define a sort order as a series of columns where