Re: Has the topic of CDC (change data capture) been considered for Iceberg? If not, should it?

OpenInx Thu, 12 Mar 2020 03:36:14 -0700

Hi Filip

We (alibaba & tencent) are doing the apache iceberg row-level
update/deletes POC,  syncing the change log (such as
row-level binlog) into iceberg data lake is the classic case we are trying
to implement (another classic case would be one
streaming job or batch job with one or more update SQL sentences).  we
currently prefer to use <Lazy with NRI> solution
discussed in update/delete doc [1] if the change log table defined the
unique key, because we can append the unique
keys into differential files while if <lazy with SRI> solution we need to
read the row identifiers for each change log entry which
would limit the ingest throughput a lot (bad experiences for data lake I
think) .  Of course maybe there're some tables which
don't have their unique key, I considered that the iceberg users can also
choose to use the <lazy with SRI> although their
throughput are limited but can work (the compute engine spark/flink may
need to shuffle the change logs into batches for
each iceberg partition so that we can get the rowId in batch and write the
differential files...it's a draft :-) ). In general, we may
focus on the cases which table have a unique key.


[1].
https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/edit?usp=sharing

Re: Has the topic of CDC (change data capture) been considered for Iceberg? If not, should it?

Reply via email to