Re: [PR] [HUDI-9526] Use HoodieFileGroupReader throughout the CDC flow [hudi]

via GitHub Sun, 06 Jul 2025 19:17:27 -0700


the-other-tim-brown commented on code in PR #13444:
URL: https://github.com/apache/hudi/pull/13444#discussion_r2188837358



##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupRecordBuffer.java:
##########
@@ -565,27 +570,62 @@ protected boolean hasNextBaseRecord(T baseRecord, 
BufferedRecord<T> logRecordInf
       Pair<Boolean, T> isDeleteAndRecord = merge(baseRecordInfo, 
logRecordInfo);
       if (!isDeleteAndRecord.getLeft()) {
         // Updates
-        nextRecord = readerContext.seal(isDeleteAndRecord.getRight());
+        nextRecord = 
readerContext.seal(applyOutputSchemaConversion(isDeleteAndRecord.getRight()));

Review Comment:
   > > The CDC logic is not really part of merging, why should they be coupled?
   > 
   > To make the logic in file group reader buffer clean and more maintainable.
   > 
   That is a fine goal but we should consider that we will now need the logic 
in two places, one in the merger and one in the file group reader buffer now 
due to the point I have already raised.
   
   > > Please also note that there will be outputs even when there is no 
merging in the case of log files with entries that are not in the base files.
   > 
   > That's why I saied `BufferedRecordMerger#finalMerge` instead of the other 
two APIs.
   The `finalMerge` is never called in this case. `finalMerge` is only used in 
the case where there is a record in the base file that is merged with some 
records coming from log files.
   
   
   > In any case, please stop introducing row-level ramifications continuously 
if the `cdc` logging is deterministic per-query.
   Can you explain what you mean by this please?
   



##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupRecordBuffer.java:
##########
@@ -565,27 +570,62 @@ protected boolean hasNextBaseRecord(T baseRecord, 
BufferedRecord<T> logRecordInf
       Pair<Boolean, T> isDeleteAndRecord = merge(baseRecordInfo, 
logRecordInfo);
       if (!isDeleteAndRecord.getLeft()) {
         // Updates
-        nextRecord = readerContext.seal(isDeleteAndRecord.getRight());
+        nextRecord = 
readerContext.seal(applyOutputSchemaConversion(isDeleteAndRecord.getRight()));

Review Comment:
   > > The CDC logic is not really part of merging, why should they be coupled?
   > 
   > To make the logic in file group reader buffer clean and more maintainable.
   > 
   That is a fine goal but we should consider that we will now need the logic 
in two places, one in the merger and one in the file group reader buffer now 
due to the point I have already raised.
   
   > > Please also note that there will be outputs even when there is no 
merging in the case of log files with entries that are not in the base files.
   > 
   > That's why I saied `BufferedRecordMerger#finalMerge` instead of the other 
two APIs.
   
   The `finalMerge` is never called in this case. `finalMerge` is only used in 
the case where there is a record in the base file that is merged with some 
records coming from log files.
   
   
   > In any case, please stop introducing row-level ramifications continuously 
if the `cdc` logging is deterministic per-query.
   
   Can you explain what you mean by this please?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9526] Use HoodieFileGroupReader throughout the CDC flow [hudi]

Reply via email to