nsivabalan commented on a change in pull request #4226: URL: https://github.com/apache/hudi/pull/4226#discussion_r763510372
########## File path: website/docs/metadata.md ########## @@ -0,0 +1,34 @@ +--- +title: Metadata Table +keywords: [ hudi, metadata, S3 file listings] +--- + +## Motivation for a Metadata Table + +The Apache Hudi Metadata Table can significantly improve read/write performance of your queries. The two main purposes of the +Metadata Table are: + +1. **Eliminate the requirement for the "list files" operation:** + 1. When reading, writing data in HDFS, file listing operations are performed to get the current view of the file system. + When data sets are large, listing all the files becomes a performance bottleneck and in the case of cloud storage systems + like AWS S3, sometimes causes throttling due to list operation request limits. The Metadata Table will instead + proactively maintain the list of files and remove the need for recursive file listing operations on HDFS. +2. **Create Column Indexes for better query planning and faster lookups by readers** + 1. For a column in the dataset, min/max range per Parquet file can be maintained. + Just by reading this index file, the query planning system should be able to get the view of potential Parquet files for a range query. Review comment: yeah, may not go into too much details about column stats for now. We can just call out that we have future enhancements planned like column index, etc. but will let you take the call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org