Hi Iceberg devs, We have been considering the problem of Time-sliced incremental scan (i.e., reading data that is committed between two timestamps), and I ran into this thread [1] in the Iceberg dev mailing list. The summary of the thread is that incremental scan should leverage snapshot IDs as opposed to timestamps since there is no guarantee that commit timestamps are linear.
I wanted to follow up on that discussion to see if folks are open to still supporting time-sliced incremental scan APIs with the caveat above. The reasons are many fold: * Time-slice APIs are more human friendly and simplify the need for a state store to track last read snapshot IDs. While some sort of state recording may be required in some cases, it is not always the case, especially if data consumption happens at regular intervals. * I understand the caveat discussed in the thread still applies to the existing "TIMESTAMP AS OF" API, yet the API is supported. By extension, I think it is fair to extend the support to time-sliced incremental scan. * Iceberg already provides a deterministic function to translate a timestamp to a snapshot ID [2]. Since a table can be scanned by snapshot range, I think it is reasonable to allow scanning by timestamp range, since the translation mechanism already exists. * Incremental scan using timestamp range is part of the SQL Standard (e.g. "VERSIONS BETWEEN SYSTEM TIME ... AND ...") and is supported by some existing engines. See [3] for SQL Server support. * Conceptually, it is possible to implement the same query semantics at the SQL level using existing APIs and operators such as "TIMESTAMP AS OF" and "EXCEPT" (e.g., by selecting: (T as of timestamp1) EXCEPT (T as of timestamp2)). It sounds that incremental scan is an optimization to push the differencing operation to the data source as opposed to letting the engine deal with it, so it is better to implement the SQL shorthand, and its corresponding data source optimization. Therefore, I would like to ask if we can proceed with supporting an API along the lines of "VERSIONS BETWEEN SYSTEM TIME ... AND ...", and consider the discussion in [1] as a caveat that folks using this API (or existing timestamp APIs) have to keep in mind, until the implementation guarantees linearity at some point in the future. Thanks, Walaa. [1] https://www.mail-archive.com/dev@iceberg.apache.org/msg01504.html [2] https://github.com/apache/iceberg/blob/1a11038baaca9f4cf03674654a6d9a85f39a1ce1/core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java#L265 [3] https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15