danny0405 commented on code in PR #13699:
URL: https://github.com/apache/hudi/pull/13699#discussion_r2271941785
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/buffer/FileGroupRecordBuffer.java:
##########
@@ -56,27 +53,29 @@
import java.io.IOException;
import java.io.Serializable;
+import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Locale;
import java.util.Map;
+import java.util.Set;
import java.util.function.Function;
import static
org.apache.hudi.common.config.HoodieCommonConfig.DISK_MAP_BITCASK_COMPRESSION_ENABLED;
import static
org.apache.hudi.common.config.HoodieCommonConfig.SPILLABLE_DISK_MAP_TYPE;
import static
org.apache.hudi.common.config.HoodieMemoryConfig.MAX_MEMORY_FOR_MERGE;
import static
org.apache.hudi.common.config.HoodieMemoryConfig.SPILLABLE_MAP_BASE_PATH;
-import static
org.apache.hudi.common.model.HoodieRecordMerger.PAYLOAD_BASED_MERGE_STRATEGY_UUID;
import static
org.apache.hudi.common.table.log.block.HoodieLogBlock.HeaderMetadataType.INSTANT_TIME;
abstract class FileGroupRecordBuffer<T> implements
HoodieFileGroupRecordBuffer<T> {
+ protected final Set<String> usedKeys = new HashSet<>();
Review Comment:
The fix is not right, the discrepencies come from the log comparison, for
old write merge handle, we do not remove the log record when a base record key
matches it, and bookeep a set of log record keys that have been matched with
base(used for skipping for those non matched log records).
While the FG reader current remove the log record directly, which makes it
impossbie to match the same log records with duplicates in base multiple times.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]