Sorry, I was mistaken about this. We have exposed the incremental read functionality using DataFrame options <https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68> .
You should be able to use it like this: val df = spark.read.format("iceberg").option("start-snapshot-id", lastSnapshotId).load("db.table") I hope that helps! rb On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote: > Replies inline. > > On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com> > wrote: > >> *If a file system does not support atomic renames, then you should use a >> metastore to track tables. You can use Hive, Nessie, or Glue. We also are >> working on a JDBC catalog.* >> >> 1. What would go wrong if I write directly to gcs from spark via iceberg? >> Do we end up having data in gcs but would be missing the iceberg metadata >> for these files ? Or would it just lose some snapshots during multiple >> parallel transactions? >> > > If you choose not to use a metastore and use a "Hadoop" table instead, > then there isn't a guarantee that concurrent writers won't clobber each > other. You'll probably lose some commits when two writers commit at the > same time with the same base version. > > >> *Iceberg's API can tell you what files were added or removed in any given >> snapshot. You can also use time travel to query the table at a given >> snapshot and use SQL to find the row-level changes. We don't currently >> support reading just the changes in a snapshot because there may be deletes >> as well as inserts.* >> >> 2. I would like to further clarify whether iceberg supports incremental >> query like >> https://hudi.apache.org/docs/querying_data.html#spark-incr-query. >> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking >> about incremental reads to query data between snapshots. But I am confused >> with above response and >> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E >> where you talked that the incremental query is not supported natively. If >> the latter way >> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E> >> is the only way to derive incremental data, does iceberg use predicate >> pushdown to get the incremental data based on file-delta as iceberg's >> metadata contain file info for both snapshots. >> > > Iceberg can plan incremental scans to read data that was added since some > snapshot. This isn't exposed through Spark yet, but could be. I've > considered adding support for git-like `..` expressions: `SELECT * FROM > db.table.1234..5678`. > > One problem with this approach is that it is limited when it encounters > something other than an append. For example, Iceberg supports atomic > overwrites to rewrite data in a table. When the latest snapshot is an > overwrite, it isn't clear exactly what an incremental read should produce. > We're open to ideas here, like producing "delete" records as well as > "insert" records with an extra column for the operation. But this is > something we'd need to consider. > > I don't think Hudi has this problem because it only supports insert and > upsert, if I remember correctly. > > -- > Ryan Blue > Software Engineer > Netflix > -- Ryan Blue Software Engineer Netflix