[GitHub] [hudi] wqwl611 commented on pull request #6636: [HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

GitBox Sun, 25 Sep 2022 01:16:04 -0700


wqwl611 commented on PR #6636:
URL: https://github.com/apache/hudi/pull/6636#issuecomment-1257144913


   > > > Hey, thanks for the contribution. It is a great enhancement for bucket 
index.
   > > > On high-level, could we use the current BucketIndex abstraction to 
unify the implementation of different BucketIndexEngines? Also, the dedicated 
Partitioner (i.e., SparkRangeBucketIndexPartitioner) may not be necessary, as 
long as we tag the file id during indexing (checkout consistent hashing which 
uses default Partitioner).
   > > 
   > > 
   > > ```
   > >  Right now, rangBucketIndex generate file like 
"00000009-0_2-12-29_20220924180225595.parquet", and it doesn't contain any UUID 
element,  I think it's ok, am I right?
   > >  By this clue, if simpleBucketIndex also act like this, 
SparkBucketIndexPartitioner may not be necessary eigther? and if use default 
partitioner， it can reduce a lot of empty spark-task。
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > @YuweiXiao
   > 
   > Yeah, I was thinking the same thing, have id as the name rather than 
concatenating the uuid. But I think the benefit is saving the metadata loading 
overhead (i.e., listing to get the filename) rather than the one you mentioned. 
With the default partitioner, it should not be empty partition 
(`UpsertPartitioner`). Please correct me if I am wrong.
   > 
   > Also, we better to follow the naming convention of the file group, in case 
of potential compatibility problems.
   
         yes，every empty bucket will access metadata in ‘getBucketInfo’，when 
partitionNum * bucketNum is very big，it‘s a heary overhead for metadata （and 
spark driver scheduler don't like it eigther）
   
   More important！I'am afraid that we can't follow ‘uuid naming convention’， 
because this name is genarated in rdd task one by one record but not a one by 
one bucket  like simpleBucketIndex rigtht now
   @YuweiXiao 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] wqwl611 commented on pull request #6636: [HUDI-4824]Add new index RANGE_BUCKET , when primary key is auto-increment like most mysql table

Reply via email to