Re: [PR] [HUDI-9685] Merge row groups for file stitching with Parquet API and group by schema [hudi]

via GitHub Fri, 08 Aug 2025 09:33:17 -0700


shangxinli commented on PR #13674:
URL: https://github.com/apache/hudi/pull/13674#issuecomment-3168584527


   > Hi @shangxinli Thanks for this PR. Before have a deep review. Could you 
please help answer a few questions: 1. What is the difference between this 
strategy and the previous copier? 2. How to solve the problem of Clustering 
metadata fields such as _hoodie_file_name 3. For BloomFilter, how to merge the 
Hoodie custom BloomFilter field in the parquet footer?
   
   Thanks @YuangZhang for reviewing it and this is great feedback! The 
_hoodie_file_name is my previous commits ignored and it should handle that. 
Actually that is blocker of using the existing Parquet API to do so. With that 
I am going to revert the code complementation and reuse your implementation. 
   
   The 2nd part of this PR is to avoid schema evolution. The reason of that is 
schema evolution ever caused outages due to the complexity of the schema 
itself.  Some of the schemas are very complex with many nested layers and 
complex data types in it like key map, array. I added a flag to control on/off 
whether or not using schema evolution. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9685] Merge row groups for file stitching with Parquet API and group by schema [hudi]

Reply via email to