[GitHub] [hudi] tjtoll commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

GitBox Fri, 18 Mar 2022 10:43:16 -0700


tjtoll commented on issue #4873:
URL: https://github.com/apache/hudi/issues/4873#issuecomment-1072645191



   > Since you are having a complex record key, I feel the range pruning w/ 
bloom is not effective. Bloom filters will be effective only if your record 
keys have some timestamp characteristics and so we can trim few file groups 
with just min and max values of record keys stored in them.
   > 
   > So, I would recommend you to try out "SIMPLE" index instead. for random or 
large updates, this might work out better. Do give 
[this](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) blog 
a read to understand index types in hudi. Also, you can check out the configs 
for simple index 
[here](https://hudi.apache.org/docs/next/configurations#hoodiesimpleindexparallelism).
   
   Is it only the record key having the timestamp characteristics? Or it is the 
partitioning as well? For example, if I have a random record key but my 
partitions are by date is BLOOM still beneficial? 
   
   Also, on tables that I do have an incrementing record key, why doesn't Hudi 
sort those before writing them? The files it writes have huge/overlapping 
ranges of record keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] tjtoll commented on issue #4873: Processing time very Slow Updating records into Hudi Dataset(MOR) using AWS Glue

Reply via email to