Re: Getting delta of data changes between 2 Snapshots

2019-07-29 Thread RD
Noted, I was calling it a snapshot since that's what the prototype constructs to pass it down to file-planning API, but that's just an implementation detail. We could add an appendsBetween(s1, s2) API, but I wanted to keep the Original Scan API separate from Incremental scan as the original scan is

Re: Getting delta of data changes between 2 Snapshots

2019-07-26 Thread Ryan Blue
Thanks for working on this! I think the overall idea of being able to plan an incremental scan is a good idea. But, we should avoid calling the incremental data a “snapshot”. A snapshot is the table state at some point in time, and I think it would be confusing if we started adding new meanings.

Re: Getting delta of data changes between 2 Snapshots

2019-07-25 Thread RD
Thanks Ryan, Iceberg can give you the data files that were added or deleted in a > snapshot, but there isn't a good way to take those and actually read them > as a DataFrame or select that data from a table in SQL. I'd think that's > a good first step One approach which I'm currently prototyping

Re: Getting delta of data changes between 2 Snapshots(Internet mail)

2019-07-17 Thread 程力
Like having a system table to store in-use snapshot? Isn’t the incremental processing much like incremental pulling in Hudi? -Li 发件人: Ryan Blue 答复: "dev@iceberg.apache.org" , "rb...@netflix.com" 日期: 2019年7月18日 星期四 上午3:55 收件人: RD 抄送: Iceberg Dev List 主题: Re: Getting d

Re: Getting delta of data changes between 2 Snapshots

2019-07-17 Thread Ryan Blue
I think it would be helpful to have a pattern for incremental processing. Iceberg can give you the data files that were added or deleted in a snapshot, but there isn't a good way to take those and actually read them as a DataFrame or select that data from a table in SQL. I'd think that's a good fir

Re: Getting delta of data changes between 2 Snapshots

2019-07-17 Thread RD
Hi Iceberg devs, We are starting work on a somewhat similar project. The idea is that users can ask for incremental data since the last snapshot they processed, i.e the delta that was added since the last snapshot. Do you guys think that whether this can be a general feature that can we benefici

Re: Getting delta of data changes between 2 Snapshots

2019-07-17 Thread Ryan Blue
You can do this using time-travel. First, read the table at each snapshot. This creates a temporary table for both snapshots: // create temp tables for each snapshot spark.read.format("iceberg").option("snapshot-id", 8924558786060583479L).load("db.table").createOrReplaceTempTable("s1") spark.read.

Getting delta of data changes between 2 Snapshots

2019-07-17 Thread aa bb
Hi, Could you please advise how we can get delta data changes (diff) between 2 Snapshots? Is there any way providing 2 Snapshot Ids (8924558786060583479, 6536733823181975045) and get records that added after 8924558786060583479 ?+-+-+--