yihua opened a new pull request #4878:
URL: https://github.com/apache/hudi/pull/4878


   ## What is the purpose of the pull request
   
   This PR adds the functionality to validate the column stats and bloom 
filters stored in metadata table against the ground truth in base files in 
`HoodieMetadataTableValidator`.
   
   ## Brief change log
   
   - Adds two new validation tasks in `HoodieMetadataTableValidator`
    - `--validate-all-column-stats`: validate column stats for all columns in 
the schema
    - `--validate-bloom-filters`: validate bloom filters of base files
   - Adds a new class `BloomFilterData` to facilitate validation of bloom 
filters
   - Makes `ParquetUtils::readMetadata` static
   - Cleans up naming and adds minor changes around logging messages and levels
   
   ## Verify this pull request
   
   This PR is verified by running the `HoodieMetadataTableValidator` with 
spark-submit for all validation task to make sure that existing and new 
validation logic work properly.  The validation of column stats is able to 
catch a bug, which is addressed in #4875 .
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to