Re: Followup from iceberg newbie questions

kkishore iiith Tue, 09 Feb 2021 18:25:08 -0800

Ryan,

It would be nice to include that in iceberg website as the feature seems
like a common ask.


Our spark job needs to return the gcs filenames as the downstream service
would load these gcs files into bigquery.

So, we have two options here, could you please clarify for both
(1) We write to iceberg via hive metastore as gcs doesn't support atomic
renames, but could we be able to write to iceberg via hive using pre-known
filenames as I see all examples using hive table path.

(2) After we write to gcs, we can use the iceberg table API to add the
files that we wrote in a transaction like follows. But does the comment
related to atomicity in the previous response also applicable via this
route i..e., losing snapshots across multiple parallel commits?

appendFiles.appendFile(DataFiles.builder(partitionSpec)
        .withPath(..)
        .withFileSizeInBytes(..)
        .withRecordCount(..)
        .withFormat(..)
        .build());

appendFiles.commit()


On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Sorry, I was mistaken about this. We have exposed the incremental read
> functionality using DataFrame options
> <https://github.com/apache/iceberg/blob/d8cc2a29364e57df95c4e50f4079bacd35e4a047/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchQueryScan.java#L67-L68>
> .
>
> You should be able to use it like this:
>
> val df = spark.read.format("iceberg").option("start-snapshot-id", 
> lastSnapshotId).load("db.table")
>
> I hope that helps!
>
> rb
>
> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Replies inline.
>>
>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com>
>> wrote:
>>
>>> *If a file system does not support atomic renames, then you should use a
>>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are
>>> working on a JDBC catalog.*
>>>
>>> 1. What would go wrong if I write directly to gcs from spark via
>>> iceberg? Do we end up having data in gcs but would be missing the iceberg
>>> metadata for these files ? Or would it just lose some snapshots during
>>> multiple parallel transactions?
>>>
>>
>> If you choose not to use a metastore and use a "Hadoop" table instead,
>> then there isn't a guarantee that concurrent writers won't clobber each
>> other. You'll probably lose some commits when two writers commit at the
>> same time with the same base version.
>>
>>
>>> *Iceberg's API can tell you what files were added or removed in any
>>> given snapshot. You can also use time travel to query the table at a given
>>> snapshot and use SQL to find the row-level changes. We don't currently
>>> support reading just the changes in a snapshot because there may be deletes
>>> as well as inserts.*
>>>
>>> 2. I would like to further clarify whether iceberg supports incremental
>>> query like
>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.
>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking
>>> about incremental reads to query data between snapshots. But I am confused
>>> with above response and
>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E
>>> where you talked that the incremental query is not supported natively. If
>>> the latter way
>>> <http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E>
>>> is the only way to derive incremental data, does iceberg use predicate
>>> pushdown to get the incremental data based on file-delta as iceberg's
>>> metadata contain file info for both snapshots.
>>>
>>
>> Iceberg can plan incremental scans to read data that was added since some
>> snapshot. This isn't exposed through Spark yet, but could be. I've
>> considered adding support for git-like `..` expressions: `SELECT * FROM
>> db.table.1234..5678`.
>>
>> One problem with this approach is that it is limited when it encounters
>> something other than an append. For example, Iceberg supports atomic
>> overwrites to rewrite data in a table. When the latest snapshot is an
>> overwrite, it isn't clear exactly what an incremental read should produce.
>> We're open to ideas here, like producing "delete" records as well as
>> "insert" records with an extra column for the operation. But this is
>> something we'd need to consider.
>>
>> I don't think Hudi has this problem because it only supports insert and
>> upsert, if I remember correctly.
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Followup from iceberg newbie questions

Reply via email to