yihua opened a new pull request #4878:
URL: https://github.com/apache/hudi/pull/4878
## What is the purpose of the pull request
This PR adds the functionality to validate the column stats and bloom
filters stored in metadata table against the ground truth in base files in
`HoodieMetadataTableValidator`.
## Brief change log
- Adds two new validation tasks in `HoodieMetadataTableValidator`
- `--validate-all-column-stats`: validate column stats for all columns in
the schema
- `--validate-bloom-filters`: validate bloom filters of base files
- Adds a new class `BloomFilterData` to facilitate validation of bloom
filters
- Makes `ParquetUtils::readMetadata` static
- Cleans up naming and adds minor changes around logging messages and levels
## Verify this pull request
This PR is verified by running the `HoodieMetadataTableValidator` with
spark-submit for all validation task to make sure that existing and new
validation logic work properly. The validation of column stats is able to
catch a bug, which is addressed in #4875 .
## Committer checklist
- [ ] Has a corresponding JIRA in PR title & commit
- [ ] Commit message is descriptive of the change
- [ ] CI is green
- [ ] Necessary doc changes done or have another open PR
- [ ] For large changes, please consider breaking it into sub-tasks under
an umbrella JIRA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]