nsivabalan commented on code in PR #13830:
URL: https://github.com/apache/hudi/pull/13830#discussion_r2323044388
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -620,8 +619,10 @@ public static <R> HoodieData<HoodieRecord<R>>
tagGlobalLocationBackToRecords(
if (currentLocOpt.isPresent()) {
HoodieRecordGlobalLocation currentLoc = currentLocOpt.get();
boolean shouldDoMergedLookUpThenTag = mayContainDuplicateLookup
- || !Objects.equals(incomingRecord.getPartitionPath(),
currentLoc.getPartitionPath());
- if (shouldUpdatePartitionPath && shouldDoMergedLookUpThenTag) {
+ || !Objects.equals(incomingRecord.getPartitionPath(),
currentLoc.getPartitionPath())
+ // if the ordering is not simply based on commit time and the
incoming record is a delete, the value needs to be compared to the existing
value before deleting the key from the index
+ || (!isCommitTimeOrdered &&
incomingRecord.isDelete(writerSchema.get(), properties));
+ if ((shouldUpdatePartitionPath || isMoRTable) &&
shouldDoMergedLookUpThenTag) {
Review Comment:
`mayContainDuplicateLookup` is for a diff purpose.
it does not refer to whether incoming records contain duplicates or not.
here in index tagging, we do it in 2 steps.
step1: do index lookup based on bloom or simple or RLI
step2: merge w/ older versions if need be. add deletes to older partitions
if record moves to diff partition etc.
In step1, even if we lookup one record key, chances that lookup might return
2 location (hence the naming duplicate lookup) for global bloom and global
simple for MOR table.
```
c1 -> p1, rk1 goes into fg1 (base file)
c2 -> rk1 moves to p2.
so,
p2, rk1 writes into fg2 in p2.
and adds a log file to fg1.
Layout:
fg1 base file, log file.
fg2 base file.
c3: rk1 moves to p3.
```
With this layout, when we do step1 for global bloom lookup, we might get
both locations (duplicate) i.e. fg1 from p1 and fg2 from p2.
but if the index is global RLI, it will only return 1 location at the end of
step1.
Not sure if there is a easier way to name the variable here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]