Hi Chen,

Iceberg doesn't guarantee any order for records returned by
`IcebergGenerics`. If you want a specific order, I'd recommend using a
query engine to sort or to read a partition at a time and then sort within
that partition.

Iceberg can't really guarantee order across files. The sort order files are
written with may change over time, and Iceberg will also use the lack of a
guarantee to work faster in some cases. For example, most job planning is
done by reading manifest files in parallel so there isn't an order that
data files are returned in. Iceberg will also pack files into tasks in most
cases (though not for `IcebergGenerics`) so files can be reordered
depending on size as well.

On Thu, Mar 25, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote:

> Popping up the question.
>
> On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com> wrote:
>
>> I want to clarify the ordering semantics (if deterministic) on partitions
>> returned when using iceberg core data API to read.
>>
>> Say I define a table with a *time* column and partition by *day(time)*, and
>> do the following writes.
>>
>> partition (day)    time                               other data fields
>> 2020-10-01         2020-10-01 01:01:01    ...
>> 2020-10-01         2020-10-01 02:01:01    ...
>> 2020-10-02         2020-10-02 01:01:01    ...
>> 2020-10-02         2020-10-02 02:01:01    ...
>>
>> Then if I do read all using something like the following.
>>
>>     IcebergGenerics.read(table).build();
>>
>> I did see rows returned in the right order in terms of partitions. Then
>> if I append the same data again and read again. I see rows returned like.
>>
>> 2020-10-01         2020-10-01 01:01:01    ...
>> 2020-10-01         2020-10-01 02:01:01    ...
>> 2020-10-02         2020-10-02 01:01:01    ...
>> 2020-10-02         2020-10-02 02:01:01    ...
>> 2020-10-01         2020-10-01 01:01:01    ...
>> 2020-10-01         2020-10-01 02:01:01    ...
>> 2020-10-02         2020-10-02 01:01:01    ...
>> 2020-10-02         2020-10-02 02:01:01    ...
>>
>> In other words, the rows returned in the order first by commit time then
>> by partition *day*. If I want to ensure the data from partition
>> 2020-10-01 is always returned before  2020-10-02 in the above example, is
>> there a way to configure the reader to do that? I checked the reader API
>> and cannot seem to find a method to do that.
>>
>> Please be noted that I am NOT talking about sorting within a partition,
>> which I know that has to be enforced by the writer.
>>
>> --
>> Chen Song
>>
>>
>
> --
> Chen Song
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to