Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-09-22 Thread Jean-Baptiste Onofré
Hi guys, Finally (sorry for the long wait :)), a first formal Iceberg Summit proposal doc is ready to be populated/reviewed: https://docs.google.com/document/d/1Uy9-qRxLtjMWJkRXsjj94Vq3VO1Mc0wGz_bnisevNh8/edit?usp=sharing Anyone can edit the document, so feel free to complete or ask questions vi

Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-09-22 Thread Ryan Blue
To me, this proposal is getting a bit ahead of where I'm comfortable. I was expecting this to address some of the big questions about how to run an event like this from an open source community, but it seems to be assuming that the event will happen and addresses logistics. Here's an example of wh

Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-09-22 Thread Owen O'Malley
It is also important to consider who is on the program committee and their affiliations. It also helps if the pc discourages sales talks (especially with propriety extensions!) They should encourage  technical ones about development and usage of the Apache project. .. OwenOn Sep 22, 2023, at 11:19,

Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

2023-09-22 Thread Nick Del Nano
Hi, I am exploring implementing the Hybrid CDC Pattern explained at 29:26 in Ryan Blue's talk CDC patterns in Apache Iceberg . The use case is: 1. Stream CDC logs to

Re: Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

2023-09-22 Thread Samarth Jain
> What is the recommendation for storing the latest snapshot ID that is successfully merged into *table*? Ideally this is committed in the same transaction as the MERGE so that reprocessing is minimized. Does Iceberg support storing this as table metadata? I do not see any related information in th

Re: Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

2023-09-22 Thread Ryan Blue
Nick, We store the latest snapshot ID that has been processed in the change log table in the snapshot summary metadata of the downstream table. Then when we go to run an update, we pick up the starting point from the last snapshot in history and use that as the start-snapshot-id in the DataFrame r

Re: Guidance on implementing Hybrid CDC Pattern from "CDC patterns in Apache Iceberg" talk

2023-09-22 Thread Alex Reid
Hi Nick, Ryan's suggestions are what I'd recommend as well. Another option would be to store the snapshot-id as a table property in the target table (though, you introduce the possibility of being out of sync since this will be 2 separate commits, in case your job fails in between or the 2nd commi