Hi everyone, Here is the Change Data Capture update. I posted a draft PR( https://github.com/apache/iceberg/pull/4539) 2 weeks ago, and got lots of reviews. Thank you all for the review. Based on the feedback, we will move forward with the approach and fire separated formal PRs. We are also planning to have a meeting to share the general idea of the approach, and next steps. Looking forward to seeing you there. Here is the meeting infor.
Change Data Capture for Iceberg Friday, April 29 · 9:00 – 10:00am Google Meet joining info Video call link: https://meet.google.com/pjv-cspg-xos Best, Yufei `This is not a contribution` On Tue, Mar 29, 2022 at 4:33 PM Yufei Gu <flyrain...@gmail.com> wrote: > Synced-up with Anton and Russell for the cdc design and implementation. > Here are changes to get deleted rows in MVP. > > We will leverage the `_deleted` metadata column for both pos deletes and > eq deletes. This eliminates limitations of the original design. Especially, > instead of emitting equality deletes directly as cdc deleted rows, we > resolve the eq deletes to actual deleted rows and emit them as CDC delete > rows. For example, an eq delete may delete two data rows. We will emit the > 2 actual deleted rows. > > We change the design so that we emit all deleted(pos and eq) rows together > in the same format. This is simpler and more efficient than the original > design. > 1. We don't have to output identifier fields. > 2. Downstream tables can write cdc deleted rows directly as an eq deletes > without using "merge". > 3. It is easier to reconstruct the update in phase 2. > > The downside is that it is expensive for certain use cases. For example, > it has to scan all data files to resolve global eq deletes. We can try to > solve this by providing an option to emit eq deletes rows directly in the > future. Please refer to > https://github.com/apache/iceberg/issues/3941#issuecomment-1081273709 for > more details. > > Let us know if you have any feedback. Thanks. > > Yufei > > > On Wed, Mar 9, 2022 at 9:59 AM Yufei Gu <flyrain...@gmail.com> wrote: > >> Hi everyone, >> >> >> Thanks for the joining and discussion in the sync-up last Friday. We’ve >> got a consensus on several items: >> >> 1. >> >> The snapshot granularity CDC generation is useful, and will cover a >> wide range of use cases. Sub-snapshot granularity is out of scope at this >> moment, which needs a separate proposal. >> 2. >> >> For COW, we should treat all rows from the deleted data files as the >> deleted rows, which is more efficient, and more importantly, it doesn’t >> yield wrong results when duplicate rows exist. >> 3. >> >> Creating a minimum viable product (MVP) according to the current >> design >> >> >> Thanks Anton for the comments in >> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554. >> >> >> With the meetup and Anton's comment, here is the plan to move forward. We >> split the implementation into two phases. The minimum viable product (MVP) >> in phase 1 will have most things from the proposal with the following >> adjustments. >> >> >> *Phase 1 (MVP)* >> >> 1. >> >> To emit delete and insert CDC records only >> 2. >> >> Don’t join for equality deletes. To emit equality deletes directly as >> deleted rows per Anton’s suggestion. Otherwise, we need to join the whole >> table with the equality delete files, which is not scalable. We will >> evaluate the cost of the join in phase 2 and support it probably, or the >> other way to approach it. >> 3. >> >> COW: to output all rows in the deleted data files as the deleted >> rows, to output all rows in the added data files as the inserted rows. We >> will figure out a more scalable way to filter out unchanged rows in phase >> 2. The approach of joining on the all columns has two issues: >> 1. >> >> Not scalable, think about a table with more than 100 columns >> 2. >> >> Cannot handle the duplicate records, e.g. (1, Amy, 20) was in the >> data files marked as deleted, then we got new data files with two same >> rows >> (1, Amy, 20) and (1, Amy, 20). >> 4. >> >> User interface: to create an action to generate CDC records instead >> of a procedure, an action can return a dataframe, which is more convenient >> than an array of InternalRow produced by a Spark procedure. >> >> *Phase 2* >> >> 1. >> >> Enable update reconstruction to emit CDC update records. >> 2. >> >> COW: to filter out unchanged rows. >> 3. >> >> User Interface: to support the metatable, which will enable more use >> cases, e.g., streaming use case. >> >> >> Best, >> >> Yufei >> >> `This is not a contribution` >> >> >> On Mon, Mar 7, 2022 at 1:30 PM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid> wrote: >> >>> Hey folks, >>> >>> Based on Yufei’s design doc and what we discussed during the sync, I >>> shared my thoughts on what can be efficiently supported right now. >>> >>> https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554 >>> >>> I’d be interested to learn more about specific use cases that would >>> violate the assumptions I listed in my comment. If you have such a use case >>> in mind, please, comment on the issue. >>> >>> - Anton >>> >>> >>> On 24 Feb 2022, at 14:57, Yufei Gu <flyrain...@gmail.com> wrote: >>> >>> Hi everyone, >>> >>> Move the CDC design discussion to next week's Friday(Mar 4), 9-10am PST >>> due to an unexpected event. The meeting link will be the same, >>> meet.google.com/vam-cmfx-feo. Thanks! >>> >>> Best, >>> >>> Yufei >>> >>> >>> On Tue, Feb 22, 2022 at 12:25 PM Yufei Gu <flyrain...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> It's great to see a lot of interest in the design. >>>> We are planning to have a meeting to discuss Iceberg CDC design on >>>> Friday(2/25) 9-10am PST. The meeting link is >>>> meet.google.com/vam-cmfx-feo. We will talk about the general idea, as >>>> well as open questions. The meeting will be recorded. >>>> >>>> >>>> Best, >>>> Yufei >>>> >>>> >>>> On Fri, Feb 11, 2022 at 3:54 PM Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >>>>> Oh cool, I have not had a chance to review much of this, but I was >>>>> having a conversation with a team which wanted similar features for a >>>>> table >>>>> so excited to see folks working on it 👍 >>>>> >>>>> On Fri, Feb 11, 2022 at 12:40 PM Yufei Gu <flyrain...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi team, >>>>>> >>>>>> We propose a way to generate the CDC records from the Iceberg tables. >>>>>> It is an approach without table spec change and write-time logging. It >>>>>> will >>>>>> cover the majority of CDC use cases, but no guarantee to all of them. We >>>>>> believe it's a good start point to approach CDC in the Iceberg. Any >>>>>> feedback is welcome! >>>>>> >>>>>> https://docs.google.com/document/d/1bN6rdLNcYOHnT3xVBfB33BoiPO06aKBo56SZmuU9pnY/edit?usp=sharing >>>>>> >>>>>> Best, >>>>>> >>>>>> Yufei >>>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >>>