[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

GitBox Mon, 22 Nov 2021 22:08:47 -0800


minihippo commented on pull request #3173:
URL: https://github.com/apache/hudi/pull/3173#issuecomment-976190383



   > Hi @nsivabalan, I've fixed all comments. The main changes are:
   > 
   > 1. Unify bucket index configurations to the HoodieIndexConfig
   > 2. On the premise that bucket index key has to be the subset of the record 
key, get the index key value at the runtime from HoodieKey by a tricky way 
without destroying the data structure. `BucketIdentifier` is introduced to do 
it.
   > 3. When `tag location`, cache the partial filesystem view in each Spark 
task. The implementation is different from bloom index which caches hoodie key 
and file name first and then join with the input data. Bucket Index is proposed 
to processing more bigger data and join is a heavy operation. Therefore, 
hoodieRecordRDD to taggedRecordRDD is a mapPartition only operation.
   
   @vinothchandar, here is the summary after all comments addressed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] minihippo commented on pull request #3173: [HUDI-1951] Add bucket hash index, compatible with the hive bucket

Reply via email to