Re: Followup from iceberg newbie questions

Ryan Blue Tue, 09 Feb 2021 17:57:46 -0800

Replies inline.

On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com>
wrote:


> *If a file system does not support atomic renames, then you should use a
> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
> working on a JDBC catalog.*
>
> 1. What would go wrong if I write directly to gcs from spark via iceberg?
> Do we end up having data in gcs but would be missing the iceberg metadata
> for these files ? Or would it just lose some snapshots during multiple
> parallel transactions?
>

If you choose not to use a metastore and use a "Hadoop" table instead, then
there isn't a guarantee that concurrent writers won't clobber each other.
You'll probably lose some commits when two writers commit at the same time
with the same base version.


> *Iceberg's API can tell you what files were added or removed in any given
> snapshot. You can also use time travel to query the table at a given
> snapshot and use SQL to find the row-level changes. We don't currently
> support reading just the changes in a snapshot because there may be deletes
> as well as inserts.*
>
> 2. I would like to further clarify whether iceberg supports incremental
> query like
> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
> about incremental reads to query data between snapshots. But I am confused
> with above response and
> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E
> where you talked that the incremental query is not supported natively. If
> the latter way
> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E>
> is the only way to derive incremental data, does iceberg use predicate
> pushdown to get the incremental data based on file-delta as iceberg's
> metadata contain file info for both snapshots.
>

Iceberg can plan incremental scans to read data that was added since some
snapshot. This isn't exposed through Spark yet, but could be. I've
considered adding support for git-like `..` expressions: `SELECT * FROM
db.table.1234..5678`.

One problem with this approach is that it is limited when it encounters
something other than an append. For example, Iceberg supports atomic
overwrites to rewrite data in a table. When the latest snapshot is an
overwrite, it isn't clear exactly what an incremental read should produce.
We're open to ideas here, like producing "delete" records as well as
"insert" records with an extra column for the operation. But this is
something we'd need to consider.

I don't think Hudi has this problem because it only supports insert and
upsert, if I remember correctly.

-- 
Ryan Blue
Software Engineer
Netflix

Re: Followup from iceberg newbie questions

Reply via email to