the-other-tim-brown commented on code in PR #13444:
URL: https://github.com/apache/hudi/pull/13444#discussion_r2192633570
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/FileGroupRecordBuffer.java:
##########
@@ -565,27 +570,62 @@ protected boolean hasNextBaseRecord(T baseRecord,
BufferedRecord<T> logRecordInf
Pair<Boolean, T> isDeleteAndRecord = merge(baseRecordInfo,
logRecordInfo);
if (!isDeleteAndRecord.getLeft()) {
// Updates
- nextRecord = readerContext.seal(isDeleteAndRecord.getRight());
+ nextRecord =
readerContext.seal(applyOutputSchemaConversion(isDeleteAndRecord.getRight()));
Review Comment:
> > Don't forget SortedKeyBasedFileGroupRecordBuffer
>
> This is used for MDT right? do we need to log cdc in such scenarios?
>
@danny0405 there is already `HoodieSortedMergeHandleWithChangeLog` so there
is already support for CDC on with sorted tables and I think we should preserve
that.
> > If we really care about a single check that happens for a subset of
rows, why isn't the same scrutiny applied to other PRs regardless of whether it
is new logic or not
>
> We did it when Lin introduces more merging mode to FG reader, that is why
we added the `BufferedRecordMerger` absractions, sub-classing the file group
record buffer is just a suggestion, you can think other alternatives to get rid
of the new introduced per-row handling and keep the core merging logic clean
and clear.
@danny0405 as mentioned before this is not really per-row. It is
per-deduped-log record. Like I also mentioned before, this performance can be
easily recovered by some other basic improvements like avoiding the auto-boxing
and un-boxing that happens in these same cases. When you call a method for
another class, this also introduces an implicit null check in java so by moving
the merger code into a separate class it has also added "per-row" degradation
by your logic. These checks are on the order of 1 nanosecond or less with
modern JVMs. So with 1M updates to the base file, you get 1 millisecond of
latency added.
We can skip using Option to also get performance improvements. Making these
extra objects is just overhead for the JVM and we can use `null` instead but
that sacrifices readability and intentionality of the code. There are so many
places we can shave a nanosecond or more in this codebase if we want to make
these tradeoffs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]