zhangyue19921010 commented on code in PR #13017:
URL: https://github.com/apache/hudi/pull/13017#discussion_r2015989050


##########
hudi-common/src/main/java/org/apache/hudi/common/util/hash/BucketIndexUtil.java:
##########
@@ -35,8 +35,8 @@ public class BucketIndexUtil {
    * @param parallelism Parallelism of the task
    * @return The partition index of this bucket.
    */
-  public static Functions.Function2<String, Integer, Integer> 
getPartitionIndexFunc(int bucketNum, int parallelism) {
-    return (partition, curBucket) -> {
+  public static Functions.Function3<Integer, String, Integer, Integer> 
getPartitionIndexFunc(int parallelism) {
+    return (bucketNum, partition, curBucket) -> {

Review Comment:
   For Partition Level Bucket Index, the basic principle is as follows (From 
our PRD experience, the current algorithm can meet most of the demands. Also 
create https://issues.apache.org/jira/browse/HUDI-9229 to track it)
   
   **Algorithm Logic**
   1. Compute Partition Base Index:
   `partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) % parallelism * 
bucketNum`. First, calculate the base index by taking the hash of partitionPath 
modulo parallelism, then multiply by bucketNum to generate a partition-specific 
offset.
   
   2. Compute Global Index: `globalIndex = partitionIndex + curBucket`. Add the 
current bucket ID (curBucket) to the partition offset.
   
   3. Map to Task ID: `Task ID = globalIndex % parallelism`. Assign the bucket 
to a Task by taking the modulo of the global index with parallelism.
   
   **Advantages**
   Simplicity: The logic is straightforward and computationally lightweight, 
suitable for quick implementation.
   
   **Issues May Non-Uniform Distribution**
   Key Issue: The initial partition offset (partitionIndex = (hash % 
parallelism) * bucketNum) causes all buckets of the same partition to be 
clustered in contiguous blocks across Tasks.
   
   Example:
   ```
   parallelism = 4, bucketNum = 3, hash(partition) % parallelism = 1 → 
partitionIndex = 1 * 3 = 3.
   
   curBucket = 0 → 3 % 4 = 3 (Task 3)
   
   curBucket = 1 → 4 % 4 = 0 (Task 0)
   
   curBucket = 2 → 5 % 4 = 1 (Task 1)
   
   Result: Task 2 receives no buckets, leading to load imbalance.
   ```
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to