Re: [PR] [HUDI-9526] Use HoodieFileGroupReader throughout the CDC flow [hudi]

via GitHub Tue, 08 Jul 2025 07:07:26 -0700


the-other-tim-brown commented on code in PR #13444:
URL: https://github.com/apache/hudi/pull/13444#discussion_r2192633570



##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupRecordBuffer.java:
##########
@@ -565,27 +570,62 @@ protected boolean hasNextBaseRecord(T baseRecord, 
BufferedRecord<T> logRecordInf
       Pair<Boolean, T> isDeleteAndRecord = merge(baseRecordInfo, 
logRecordInfo);
       if (!isDeleteAndRecord.getLeft()) {
         // Updates
-        nextRecord = readerContext.seal(isDeleteAndRecord.getRight());
+        nextRecord = 
readerContext.seal(applyOutputSchemaConversion(isDeleteAndRecord.getRight()));

Review Comment:
   > > Don't forget SortedKeyBasedFileGroupRecordBuffer
   > 
   > This is used for MDT right? do we need to log cdc in such scenarios?
   > 
   @danny0405 there is already `HoodieSortedMergeHandleWithChangeLog` so there 
is already support for CDC on with sorted tables and I think we should preserve 
that. 
   
   > > If we really care about a single check that happens for a subset of 
rows, why isn't the same scrutiny applied to other PRs regardless of whether it 
is new logic or not
   > 
   > We did it when Lin introduces more merging mode to FG reader, that is why 
we added the `BufferedRecordMerger` absractions, sub-classing the file group 
record buffer is just a suggestion, you can think other alternatives to get rid 
of the new introduced per-row handling and keep the core merging logic clean 
and clear.
   
   @danny0405 as mentioned before this is not really per-row. It is 
per-deduped-log record. Like I also mentioned before, this performance can be 
easily recovered by some other basic improvements like avoiding the auto-boxing 
and un-boxing that happens in these same cases. When you call a method for 
another class, this also introduces an implicit null check in java so by moving 
the merger code into a separate class it has also added "per-row" degradation 
by your logic. These checks are on the order of 1 nanosecond or less with 
modern JVMs. So with 1M updates to the base file, you get 1 millisecond of 
latency added.
   
   We can skip using Option to also get performance improvements. Making these 
extra objects is just overhead for the JVM and we can use `null` instead but 
that sacrifices readability and intentionality of the code. There are so many 
places we can shave a nanosecond or more in this codebase if we want to make 
these tradeoffs.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9526] Use HoodieFileGroupReader throughout the CDC flow [hudi]

Reply via email to