Ryan, I agree with you about that part, could you please clarify (1) and (2)
Sent from my iPhone > On Feb 10, 2021, at 8:55 AM, Ryan Blue <rb...@netflix.com> wrote: > > > You should always write to Iceberg using the supplied integration. That way > your data has metrics to make your queries faster and schemas are written > correctly. > >> On Tue, Feb 9, 2021 at 6:24 PM kkishore iiith <kkishore.ii...@gmail.com> >> wrote: >> Ryan, >> >> It would be nice to include that in iceberg website as the feature seems >> like a common ask. >> >> Our spark job needs to return the gcs filenames as the downstream service >> would load these gcs files into bigquery. >> >> So, we have two options here, could you please clarify for both >> (1) We write to iceberg via hive metastore as gcs doesn't support atomic >> renames, but could we be able to write to iceberg via hive using pre-known >> filenames as I see all examples using hive table path. >> >> (2) After we write to gcs, we can use the iceberg table API to add the files >> that we wrote in a transaction like follows. But does the comment related to >> atomicity in the previous response also applicable via this route i..e., >> losing snapshots across multiple parallel commits? >> >> appendFiles.appendFile(DataFiles.builder(partitionSpec) >> .withPath(..) >> .withFileSizeInBytes(..) >> .withRecordCount(..) >> .withFormat(..) >> .build()); >> appendFiles.commit() >> >>> On Tue, Feb 9, 2021 at 6:05 PM Ryan Blue <rb...@netflix.com.invalid> wrote: >>> Sorry, I was mistaken about this. We have exposed the incremental read >>> functionality using DataFrame options. >>> >>> You should be able to use it like this: >>> >>> val df = spark.read.format("iceberg").option("start-snapshot-id", >>> lastSnapshotId).load("db.table") >>> I hope that helps! >>> >>> rb >>> >>> >>>> On Tue, Feb 9, 2021 at 5:57 PM Ryan Blue <rb...@netflix.com> wrote: >>>> Replies inline. >>>> >>>>> On Tue, Feb 9, 2021 at 5:36 PM kkishore iiith <kkishore.ii...@gmail.com> >>>>> wrote: >>>>> If a file system does not support atomic renames, then you should use a >>>>> metastore to track tables. You can use Hive, Nessie, or Glue. We also are >>>>> working on a JDBC catalog. >>>>> >>>>> 1. What would go wrong if I write directly to gcs from spark via iceberg? >>>>> Do we end up having data in gcs but would be missing the iceberg metadata >>>>> for these files ? Or would it just lose some snapshots during multiple >>>>> parallel transactions? >>>> >>>> If you choose not to use a metastore and use a "Hadoop" table instead, >>>> then there isn't a guarantee that concurrent writers won't clobber each >>>> other. You'll probably lose some commits when two writers commit at the >>>> same time with the same base version. >>>> >>>>> Iceberg's API can tell you what files were added or removed in any given >>>>> snapshot. You can also use time travel to query the table at a given >>>>> snapshot and use SQL to find the row-level changes. We don't currently >>>>> support reading just the changes in a snapshot because there may be >>>>> deletes as well as inserts. >>>>> >>>>> 2. I would like to further clarify whether iceberg supports incremental >>>>> query like >>>>> https://hudi.apache.org/docs/querying_data.html#spark-incr-query. >>>>> https://medium.com/adobetech/iceberg-at-adobe-88cf1950e866 was talking >>>>> about incremental reads to query data between snapshots. But I am >>>>> confused with above response and >>>>> http://mail-archives.apache.org/mod_mbox/iceberg-dev/201907.mbox/%3ca237bb81-f4da-45d9-9827-36203624f...@tencent.com%3E >>>>> where you talked that the incremental query is not supported natively. >>>>> If the latter way is the only way to derive incremental data, does >>>>> iceberg use predicate pushdown to get the incremental data based on >>>>> file-delta as iceberg's metadata contain file info for both snapshots. >>>> >>>> >>>> Iceberg can plan incremental scans to read data that was added since some >>>> snapshot. This isn't exposed through Spark yet, but could be. I've >>>> considered adding support for git-like `..` expressions: `SELECT * FROM >>>> db.table.1234..5678`. >>>> >>>> One problem with this approach is that it is limited when it encounters >>>> something other than an append. For example, Iceberg supports atomic >>>> overwrites to rewrite data in a table. When the latest snapshot is an >>>> overwrite, it isn't clear exactly what an incremental read should produce. >>>> We're open to ideas here, like producing "delete" records as well as >>>> "insert" records with an extra column for the operation. But this is >>>> something we'd need to consider. >>>> >>>> I don't think Hudi has this problem because it only supports insert and >>>> upsert, if I remember correctly. >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix > > > -- > Ryan Blue > Software Engineer > Netflix