[ https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475171#comment-17475171 ]
Ethan Guo commented on HUDI-64: ------------------------------- [~vinoth] [~yanghua] Based on my reading of the thread, it looks like that item #3 was related to the integration of Hudi with Flink at the time. Since Flink is already integrated with Hudi, #3 looks more like an optimization we can do to improve the write path in Hudi + Flink leveraging the WorkloadProfile (it is still not used in Flink engine). Is my understanding correct? I think we can approach the optimization aspects of storage like file sizes, partitioning, etc. in three phases (concrete items are not exhaustive): # Expanding commit metadata with useful storage information ** Goal: Commit metadata are the ground truth for the heuristics. One can always measure how well the estimation/heuristics do by only looking at the commit metadata, e.g., comparing actual vs targeted compression ratio, actual vs targeted file size, etc. without scanning the data files. ** Concrete items: *** Add bytes to write before compression, targeted compression ratio, sizing info (base files) *** Add breakdown of bytes between inserts & updates (base + log files) # Providing a framework to plug in customized estimation/heuristic algorithms ** Goal: New estimation/heuristic algorithm can be plugged in by providing the class name through a config, without changing core write pipeline code ** Concrete items: *** Add an abstract class for the optimization strategy containing methods for different operations, file sizing, estimation of inserts/updates, etc. # Experimenting with different estimation/heuristic algorithms ** Goal: Trying different heuristics with perf evaluation, e.g., simple average like existing, moving average, using historical info, etc. ** Concrete items: *** Convert existing heuristics into an optimization strategy to start with I would say item 1,2 above are low-hanging fruits that we can do without changing the way of existing estimation/heuristic. Item 3 is non-trivial and may require some effort to get to the best; but when we get time, we can always experiment with sth and make incremental progress. ---- Current commit metadata: {code:java} { "fileId" : "a525f37d-36f3-4543-8dc9-85596d307049-20", "path" : "2021/7/19/a525f37d-36f3-4543-8dc9-85596d307049-20_15-5-78_20211222170050726.parquet", "prevCommit" : "null", "numWrites" : 4420, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 4420, "totalWriteBytes" : 1905577, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "2021/7/19", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 1905577, "minEventTime" : null, "maxEventTime" : null } {code} > Estimation of compression ratio & other dynamic storage knobs based on > historical stats > --------------------------------------------------------------------------------------- > > Key: HUDI-64 > URL: https://issues.apache.org/jira/browse/HUDI-64 > Project: Apache Hudi > Issue Type: New Feature > Components: Storage Management, Writer Core > Reporter: Vinoth Chandar > Assignee: Ethan Guo > Priority: Blocker > Labels: help-requested > Fix For: 0.11.0 > > > Something core to Hudi writing is using heuristics or runtime workload > statistics to optimize aspects of storage like file sizes, partitioning and > so on. > Below lists all such places. > > # Compression ratio for parquet > [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46] > . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it > has written for a given parquet file and closes the parquet file once the > configured size has reached. DFSOutputStream level we only know bytes written > before compression. Once enough data has been written, it should be possible > to replace this by a simple estimate of what the avg record size would be > (commit metadata would give you size and number of records in each file) > # Very similar problem exists for log files > [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52] > We write data into logs in avro and can log updates to same record in > parquet multiple times. We need to estimate again how large the log file(s) > can grow to, and still we would be able to produce a parquet file of > configured size during compaction. (hope I conveyed this clearly) > # WorkloadProfile : > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java] > caches the input records using Spark Caching and computes the shape of the > workload, i.e how many records per partition, how many inserts vs updates > etc. This is used by the Partitioner here > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141] > for assigning records to a file group. This is the critical one to replace > for Flink support and probably the hardest, since we need to guess input, > which is not always possible? > # Within partitioner, we already derive a simple average size per record > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756] > from the last commit metadata alone. This can be generalized. (default : > [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71]) > > # > Our goal in this Jira is to see, if could derive this information in the > background purely using the commit metadata.. Some parts of this are > open-ended.. Good starting point would be to see whats feasible, estimate ROI > before aactually implementing > > > > > > > Roughly along the likes of. [https://github.com/uber/hudi/issues/270] -- This message was sent by Atlassian Jira (v8.20.1#820001)