Davis-Zhang-Onehouse commented on code in PR #13489:
URL: https://github.com/apache/hudi/pull/13489#discussion_r2180615966
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -277,61 +268,291 @@ private Predicate
transformKeyPrefixesToPredicate(List<String> keyPrefixes) {
return Predicates.startsWithAny(null, right);
}
- @Override
- protected Map<String, HoodieRecord<HoodieMetadataPayload>>
getRecordsByKeys(List<String> keys, String partitionName) {
- if (keys.isEmpty()) {
- return Collections.emptyMap();
+ private HoodiePairData<String, HoodieRecord<HoodieMetadataPayload>>
doLookupWithMapping(
+ HoodieData<String> keys, String partitionName, List<FileSlice>
fileSlices, boolean isSecondaryIndex,
+ Option<SerializableFunction<String, String>> keyEncodingFn) {
+
+ final int numFileSlices = fileSlices.size();
+ if (numFileSlices == 1) {
+ TreeSet<String> distinctSortedKeys = new TreeSet<>(keys.collectAsList());
+ return lookupRecordsWithMapping(partitionName, new
ArrayList<>(distinctSortedKeys), fileSlices.get(0), !isSecondaryIndex,
keyEncodingFn);
}
- Map<String, HoodieRecord<HoodieMetadataPayload>> result;
+ SerializableBiFunction<String, Integer, Integer> mappingFunction =
HoodieTableMetadataUtil::mapRecordKeyToFileGroupIndex;
Review Comment:
> do we mean the sharding function that maps a secondary index entry to the
right shard/filegroup within the MDT partition?
yes
> any micro-benchmarks? relative to other costs of shuffling/writing.. we
think this is high impact still?
It is not backed by any numbers. I just noticed that
- caller always know if it is concatenated key or not and knows exactly what
it wants, it is a matter of what hash function is required before we doing
anything
- we never have the case where there is a mix of concatenated key with
non-concatenated ones which requires hash func itself to check and decide on
the fly.
- since this is applied per record, you won't see difference in micro
benchmark, but only when we look up 1M or more keys that we might see 30 sec
slower (I guessed the number)
- Interface level to make it clean we can wrap the
shouldExtractSecondaryKeyForHashing in a context object as a commonly used
design pattern.
If given the context we think it does not worth code maintainability effort,
I'm happy to do whatever you suggest.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]