Yeah, I'd use IcebergGenerics to read a table. That's the simplest way. On Thu, Mar 25, 2021 at 11:49 AM Chen Song <chen.song...@gmail.com> wrote:
> Thanks Ryan. Reading one partition at a time sounds a logical thing to me > in my case. > > I cannot use a query engine for now. In that case, if IcebergGenerics > still the best way to read via core API? > > On Thu, Mar 25, 2021 at 2:16 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Hi Chen, >> >> Iceberg doesn't guarantee any order for records returned by >> `IcebergGenerics`. If you want a specific order, I'd recommend using a >> query engine to sort or to read a partition at a time and then sort within >> that partition. >> >> Iceberg can't really guarantee order across files. The sort order files >> are written with may change over time, and Iceberg will also use the lack >> of a guarantee to work faster in some cases. For example, most job planning >> is done by reading manifest files in parallel so there isn't an order that >> data files are returned in. Iceberg will also pack files into tasks in most >> cases (though not for `IcebergGenerics`) so files can be reordered >> depending on size as well. >> >> On Thu, Mar 25, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote: >> >>> Popping up the question. >>> >>> On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com> >>> wrote: >>> >>>> I want to clarify the ordering semantics (if deterministic) on >>>> partitions returned when using iceberg core data API to read. >>>> >>>> Say I define a table with a *time* column and partition by *day(time)*, >>>> and do the following writes. >>>> >>>> partition (day) time other data fields >>>> 2020-10-01 2020-10-01 01:01:01 ... >>>> 2020-10-01 2020-10-01 02:01:01 ... >>>> 2020-10-02 2020-10-02 01:01:01 ... >>>> 2020-10-02 2020-10-02 02:01:01 ... >>>> >>>> Then if I do read all using something like the following. >>>> >>>> IcebergGenerics.read(table).build(); >>>> >>>> I did see rows returned in the right order in terms of partitions. Then >>>> if I append the same data again and read again. I see rows returned like. >>>> >>>> 2020-10-01 2020-10-01 01:01:01 ... >>>> 2020-10-01 2020-10-01 02:01:01 ... >>>> 2020-10-02 2020-10-02 01:01:01 ... >>>> 2020-10-02 2020-10-02 02:01:01 ... >>>> 2020-10-01 2020-10-01 01:01:01 ... >>>> 2020-10-01 2020-10-01 02:01:01 ... >>>> 2020-10-02 2020-10-02 01:01:01 ... >>>> 2020-10-02 2020-10-02 02:01:01 ... >>>> >>>> In other words, the rows returned in the order first by commit time >>>> then by partition *day*. If I want to ensure the data from partition >>>> 2020-10-01 is always returned before 2020-10-02 in the above example, is >>>> there a way to configure the reader to do that? I checked the reader API >>>> and cannot seem to find a method to do that. >>>> >>>> Please be noted that I am NOT talking about sorting within a partition, >>>> which I know that has to be enforced by the writer. >>>> >>>> -- >>>> Chen Song >>>> >>>> >>> >>> -- >>> Chen Song >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > Chen Song > > -- Ryan Blue Software Engineer Netflix