Popping up the question. On Wed, Mar 24, 2021 at 2:01 PM Chen Song <chen.song...@gmail.com> wrote:
> I want to clarify the ordering semantics (if deterministic) on partitions > returned when using iceberg core data API to read. > > Say I define a table with a *time* column and partition by *day(time)*, and > do the following writes. > > partition (day) time other data fields > 2020-10-01 2020-10-01 01:01:01 ... > 2020-10-01 2020-10-01 02:01:01 ... > 2020-10-02 2020-10-02 01:01:01 ... > 2020-10-02 2020-10-02 02:01:01 ... > > Then if I do read all using something like the following. > > IcebergGenerics.read(table).build(); > > I did see rows returned in the right order in terms of partitions. Then if > I append the same data again and read again. I see rows returned like. > > 2020-10-01 2020-10-01 01:01:01 ... > 2020-10-01 2020-10-01 02:01:01 ... > 2020-10-02 2020-10-02 01:01:01 ... > 2020-10-02 2020-10-02 02:01:01 ... > 2020-10-01 2020-10-01 01:01:01 ... > 2020-10-01 2020-10-01 02:01:01 ... > 2020-10-02 2020-10-02 01:01:01 ... > 2020-10-02 2020-10-02 02:01:01 ... > > In other words, the rows returned in the order first by commit time then > by partition *day*. If I want to ensure the data from partition > 2020-10-01 is always returned before 2020-10-02 in the above example, is > there a way to configure the reader to do that? I checked the reader API > and cannot seem to find a method to do that. > > Please be noted that I am NOT talking about sorting within a partition, > which I know that has to be enforced by the writer. > > -- > Chen Song > > -- Chen Song