zhangyue19921010 commented on code in PR #13017: URL: https://github.com/apache/hudi/pull/13017#discussion_r2015989050
########## hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java: ########## @@ -35,8 +35,8 @@ public class BucketIndexUtil { * @param parallelism Parallelism of the task * @return The partition index of this bucket. */ - public static Functions.Function2<String, Integer, Integer> getPartitionIndexFunc(int bucketNum, int parallelism) { - return (partition, curBucket) -> { + public static Functions.Function3<Integer, String, Integer, Integer> getPartitionIndexFunc(int parallelism) { + return (bucketNum, partition, curBucket) -> { Review Comment: For Partition Level Bucket Index, the basic principle is as follows (From our PRD experience, the current algorithm can meet most of the demands. Also create https://issues.apache.org/jira/browse/HUDI-9229 to track it) **Algorithm Logic** 1. Compute Partition Base Index: `partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * bucketNum`. First, calculate the base index by taking the hash of partitionPath modulo parallelism, then multiply by bucketNum to generate a partition-specific offset. 2. Compute Global Index: `globalIndex = partitionIndex + curBucket`. Add the current bucket ID (curBucket) to the partition offset. 3. Map to Task ID: `Task ID = globalIndex % parallelism`. Assign the bucket to a Task by taking the modulo of the global index with parallelism. **Advantages** Simplicity: The logic is straightforward and computationally lightweight, suitable for quick implementation. **Issues May Non-Uniform Distribution** Key Issue: The initial partition offset (partitionIndex = (hash % parallelism) * bucketNum) causes all buckets of the same partition to be clustered in contiguous blocks across Tasks. Example: ``` parallelism = 4, bucketNum = 3, hash(partition) % parallelism = 1 → partitionIndex = 1 * 3 = 3. curBucket = 0 → 3 % 4 = 3 (Task 3) curBucket = 1 → 4 % 4 = 0 (Task 0) curBucket = 2 → 5 % 4 = 1 (Task 1) Result: Task 2 receives no buckets, leading to load imbalance. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org