tjtoll commented on issue #4873: URL: https://github.com/apache/hudi/issues/4873#issuecomment-1072645191
> Since you are having a complex record key, I feel the range pruning w/ bloom is not effective. Bloom filters will be effective only if your record keys have some timestamp characteristics and so we can trim few file groups with just min and max values of record keys stored in them. > > So, I would recommend you to try out "SIMPLE" index instead. for random or large updates, this might work out better. Do give [this](https://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/) blog a read to understand index types in hudi. Also, you can check out the configs for simple index [here](https://hudi.apache.org/docs/next/configurations#hoodiesimpleindexparallelism). Is it only the record key having the timestamp characteristics? Or it is the partitioning as well? For example, if I have a random record key but my partitions are by date is BLOOM still beneficial? Also, on tables that I do have an incrementing record key, why doesn't Hudi sort those before writing them? The files it writes have huge/overlapping ranges of record keys. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
