Yeah, I'd use IcebergGenerics to read a table. That's the simplest way.

On Thu, Mar 25, 2021 at 11:49 AM Chen Song <chen.song...@gmail.com> wrote:

> Thanks Ryan. Reading one partition at a time sounds a logical thing to me
> in my case.
>
> I cannot use a query engine for now. In that case, if IcebergGenerics
> still the best way to read via core API?
>
> On Thu, Mar 25, 2021 at 2:16 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Hi Chen,
>>
>> Iceberg doesn't guarantee any order for records returned by
>> `IcebergGenerics`. If you want a specific order, I'd recommend using a
>> query engine to sort or to read a partition at a time and then sort within
>> that partition.
>>
>> Iceberg can't really guarantee order across files. The sort order files
>> are written with may change over time, and Iceberg will also use the lack
>> of a guarantee to work faster in some cases. For example, most job planning
>> is done by reading manifest files in parallel so there isn't an order that
>> data files are returned in. Iceberg will also pack files into tasks in most
>> cases (though not for `IcebergGenerics`) so files can be reordered
>> depending on size as well.
>>
>> On Thu, Mar 25, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote:
>>
>>> Popping up the question.
>>>
>>> On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com>
>>> wrote:
>>>
>>>> I want to clarify the ordering semantics (if deterministic) on
>>>> partitions returned when using iceberg core data API to read.
>>>>
>>>> Say I define a table with a *time* column and partition by *day(time)*,
>>>>  and do the following writes.
>>>>
>>>> partition (day)    time                               other data fields
>>>> 2020-10-01         2020-10-01 01:01:01    ...
>>>> 2020-10-01         2020-10-01 02:01:01    ...
>>>> 2020-10-02         2020-10-02 01:01:01    ...
>>>> 2020-10-02         2020-10-02 02:01:01    ...
>>>>
>>>> Then if I do read all using something like the following.
>>>>
>>>>     IcebergGenerics.read(table).build();
>>>>
>>>> I did see rows returned in the right order in terms of partitions. Then
>>>> if I append the same data again and read again. I see rows returned like.
>>>>
>>>> 2020-10-01         2020-10-01 01:01:01    ...
>>>> 2020-10-01         2020-10-01 02:01:01    ...
>>>> 2020-10-02         2020-10-02 01:01:01    ...
>>>> 2020-10-02         2020-10-02 02:01:01    ...
>>>> 2020-10-01         2020-10-01 01:01:01    ...
>>>> 2020-10-01         2020-10-01 02:01:01    ...
>>>> 2020-10-02         2020-10-02 01:01:01    ...
>>>> 2020-10-02         2020-10-02 02:01:01    ...
>>>>
>>>> In other words, the rows returned in the order first by commit time
>>>> then by partition *day*. If I want to ensure the data from partition
>>>> 2020-10-01 is always returned before  2020-10-02 in the above example, is
>>>> there a way to configure the reader to do that? I checked the reader API
>>>> and cannot seem to find a method to do that.
>>>>
>>>> Please be noted that I am NOT talking about sorting within a partition,
>>>> which I know that has to be enforced by the writer.
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Chen Song
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to