prasannarajaperumal commented on PR #5885:
URL: https://github.com/apache/hudi/pull/5885#issuecomment-1186159978

   Hey @YannByron ,
   
   Thanks for this PR and a well written 
[RFC-51](https://github.com/apache/hudi/blob/master/rfc/rfc-51/rfc-51.md).
   Overall I agree with the high level direction. I will do the code review 
soon. I have a question before that. 
   
   Should we introduce a new concept (CDC) here on Hudi tables? I think this 
should be sub-mode of Incremental Query. 
   For illustration, Suppose we have something like the following modes for 
incremental query (change stream)
   - LATEST_STATE_INSERT_DELETE_KEYS (entire row state for all inserted keys 
and empty delete keys?)
   - LATEST_STATE_ONLY_INSERT_KEYS
   - MIN_STATE_CHANGE_INSERT_DELETE_KEYS (only columns changed and consolidate 
multiple inserts,deletes, or remove data inserted and deleted within the time 
range)
   - ALL_STATE_CHANGES_INSERT_DELETE_KEYS (include every single change made to 
the key)
   
   I think read-schema changes for the CDC style incremental queries could be a 
challenge. 
   
   The reason I think of converging the incremental queries with RFC-51 is 
because 
   - Removes the limitation of tracking deletes accross compaction boundaries 
for incremental queries
   - I think it just makes sense for us to track the data we track when 
"cdc.supplemental.logging=false" by default for all Hudi tables. Having this 
data stored efficiently for point lookups will help with record merging as well 
I suppose?
   
   @YannByron What do you think? (cc @vinothchandar )
   
   Cheers
   Prasanna
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to