Re: Followup from iceberg newbie questions

Ryan Blue Tue, 09 Feb 2021 18:05:38 -0800

Sorry, I was mistaken about this. We have exposed the incremental read
functionality using DataFrame options
<https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
.


You should be able to use it like this:

val df = spark.read.format("iceberg").option("start-snapshot-id",
lastSnapshotId).load("db.table")

I hope that helps!

rb

On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:

> Replies inline.
>
> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com>
> wrote:
>
>> *If a file system does not support atomic renames, then you should use a
>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>> working on a JDBC catalog.*
>>
>> 1. What would go wrong if I write directly to gcs from spark via iceberg?
>> Do we end up having data in gcs but would be missing the iceberg metadata
>> for these files ? Or would it just lose some snapshots during multiple
>> parallel transactions?
>>
>
> If you choose not to use a metastore and use a "Hadoop" table instead,
> then there isn't a guarantee that concurrent writers won't clobber each
> other. You'll probably lose some commits when two writers commit at the
> same time with the same base version.
>
>
>> *Iceberg's API can tell you what files were added or removed in any given
>> snapshot. You can also use time travel to query the table at a given
>> snapshot and use SQL to find the row-level changes. We don't currently
>> support reading just the changes in a snapshot because there may be deletes
>> as well as inserts.*
>>
>> 2. I would like to further clarify whether iceberg supports incremental
>> query like
>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
>> about incremental reads to query data between snapshots. But I am confused
>> with above response and
>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E
>> where you talked that the incremental query is not supported natively. If
>> the latter way
>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E>
>> is the only way to derive incremental data, does iceberg use predicate
>> pushdown to get the incremental data based on file-delta as iceberg's
>> metadata contain file info for both snapshots.
>>
>
> Iceberg can plan incremental scans to read data that was added since some
> snapshot. This isn't exposed through Spark yet, but could be. I've
> considered adding support for git-like `..` expressions: `SELECT * FROM
> db.table.1234..5678`.
>
> One problem with this approach is that it is limited when it encounters
> something other than an append. For example, Iceberg supports atomic
> overwrites to rewrite data in a table. When the latest snapshot is an
> overwrite, it isn't clear exactly what an incremental read should produce.
> We're open to ideas here, like producing "delete" records as well as
> "insert" records with an extra column for the operation. But this is
> something we'd need to consider.
>
> I don't think Hudi has this problem because it only supports insert and
> upsert, if I remember correctly.
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Followup from iceberg newbie questions

Reply via email to