Re: Followup from iceberg newbie questions

Professional Wed, 10 Feb 2021 09:46:15 -0800

Ryan,

I agree with you about that part, could you please clarify (1) and (2)


Sent from my iPhone

> On Feb 10, 2021, at 8:55 AM, Ryan Blue <rb...@netflix.com> wrote:
> 
> 
> You should always write to Iceberg using the supplied integration. That way 
> your data has metrics to make your queries faster and schemas are written 
> correctly.
> 
>> On Tue, Feb 9, 2021 at 6:24 PM kkishore iiith <kkishore.ii...@gmail.com> 
>> wrote:
>> Ryan,
>> 
>> It would be nice to include that in iceberg website as the feature seems 
>> like a common ask.
>> 
>> Our spark job needs to return the gcs filenames as the downstream service 
>> would load these gcs files into bigquery.
>> 
>> So, we have two options here, could you please clarify for both
>> (1) We write to iceberg via hive metastore as gcs doesn't support atomic 
>> renames, but could we be able to write to iceberg via hive using pre-known 
>> filenames as I see all examples using hive table path.
>> 
>> (2) After we write to gcs, we can use the iceberg table API to add the files 
>> that we wrote in a transaction like follows. But does the comment related to 
>> atomicity in the previous response also applicable via this route i..e., 
>> losing snapshots across multiple parallel commits?
>> 
>> appendFiles.appendFile(DataFiles.builder(partitionSpec)
>>         .withPath(..)
>>         .withFileSizeInBytes(..)
>>         .withRecordCount(..)
>>         .withFormat(..)
>>         .build());
>> appendFiles.commit()
>> 
>>> On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>>> Sorry, I was mistaken about this. We have exposed the incremental read 
>>> functionality using DataFrame options.
>>> 
>>> You should be able to use it like this:
>>> 
>>> val df = spark.read.format("iceberg").option("start-snapshot-id", 
>>> lastSnapshotId).load("db.table")
>>> I hope that helps!
>>> 
>>> rb
>>> 
>>> 
>>>> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote:
>>>> Replies inline.
>>>> 
>>>>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com> 
>>>>> wrote:
>>>>> If a file system does not support atomic renames, then you should use a 
>>>>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are 
>>>>> working on a JDBC catalog.
>>>>> 
>>>>> 1. What would go wrong if I write directly to gcs from spark via iceberg? 
>>>>> Do we end up having data in gcs but would be missing the iceberg metadata 
>>>>> for these files ? Or would it just lose some snapshots during multiple 
>>>>> parallel transactions?
>>>> 
>>>> If you choose not to use a metastore and use a "Hadoop" table instead, 
>>>> then there isn't a guarantee that concurrent writers won't clobber each 
>>>> other. You'll probably lose some commits when two writers commit at the 
>>>> same time with the same base version.
>>>>  
>>>>> Iceberg's API can tell you what files were added or removed in any given 
>>>>> snapshot. You can also use time travel to query the table at a given 
>>>>> snapshot and use SQL to find the row-level changes. We don't currently 
>>>>> support reading just the changes in a snapshot because there may be 
>>>>> deletes as well as inserts.
>>>>> 
>>>>> 2. I would like to further clarify whether iceberg supports incremental 
>>>>> query like 
>>>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query.  
>>>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking 
>>>>> about incremental reads to query data between snapshots. But I am 
>>>>> confused with above response and 
>>>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E
>>>>>  where you talked that the incremental query is not supported natively. 
>>>>> If the latter way is the only way to derive incremental data, does 
>>>>> iceberg use predicate pushdown to get the incremental data based on 
>>>>> file-delta as iceberg's metadata contain file info for both snapshots.
>>>> 
>>>> 
>>>> Iceberg can plan incremental scans to read data that was added since some 
>>>> snapshot. This isn't exposed through Spark yet, but could be. I've 
>>>> considered adding support for git-like `..` expressions: `SELECT * FROM 
>>>> db.table.1234..5678`.
>>>> 
>>>> One problem with this approach is that it is limited when it encounters 
>>>> something other than an append. For example, Iceberg supports atomic 
>>>> overwrites to rewrite data in a table. When the latest snapshot is an 
>>>> overwrite, it isn't clear exactly what an incremental read should produce. 
>>>> We're open to ideas here, like producing "delete" records as well as 
>>>> "insert" records with an extra column for the operation. But this is 
>>>> something we'd need to consider.
>>>> 
>>>> I don't think Hudi has this problem because it only supports insert and 
>>>> upsert, if I remember correctly.
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Followup from iceberg newbie questions

Reply via email to