[ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475171#comment-17475171
 ] 

Ethan Guo commented on HUDI-64:
-------------------------------

[~vinoth] [~yanghua] Based on my reading of the thread, it looks like that item 
#3 was related to the integration of Hudi with Flink at the time.  Since Flink 
is already integrated with Hudi, #3 looks more like an optimization we can do 
to improve the write path in Hudi + Flink leveraging the WorkloadProfile (it is 
still not used in Flink engine).  Is my understanding correct?

I think we can approach the optimization aspects of storage like file sizes, 
partitioning, etc. in three phases (concrete items are not exhaustive):
 # Expanding commit metadata with useful storage information
 ** Goal:  Commit metadata are the ground truth for the heuristics.  One can 
always measure how well the estimation/heuristics do by only looking at the 
commit metadata, e.g., comparing actual vs targeted compression ratio, actual 
vs targeted file size, etc. without scanning the data files.
 ** Concrete items:
 *** Add bytes to write before compression, targeted compression ratio, sizing 
info (base files)
 *** Add breakdown of bytes between inserts & updates (base + log files)
 # Providing a framework to plug in customized estimation/heuristic algorithms
 ** Goal: New estimation/heuristic algorithm can be plugged in by providing the 
class name through a config, without changing core write pipeline code
 ** Concrete items:
 *** Add an abstract class for the optimization strategy containing methods for 
different operations, file sizing, estimation of inserts/updates, etc.
 # Experimenting with different estimation/heuristic algorithms
 ** Goal: Trying different heuristics with perf evaluation, e.g., simple 
average like existing, moving average, using historical info, etc.
 ** Concrete items:
 *** Convert existing heuristics into an optimization strategy to start with

I would say item 1,2 above are low-hanging fruits that we can do without 
changing the way of existing estimation/heuristic.  Item 3 is non-trivial and 
may require some effort to get to the best; but when we get time, we can always 
experiment with sth and make incremental progress.
----
Current commit metadata:

 
{code:java}
    {
      "fileId" : "a525f37d-36f3-4543-8dc9-85596d307049-20",
      "path" : 
"2021/7/19/a525f37d-36f3-4543-8dc9-85596d307049-20_15-5-78_20211222170050726.parquet",
      "prevCommit" : "null",
      "numWrites" : 4420,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 4420,
      "totalWriteBytes" : 1905577,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "2021/7/19",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 1905577,
      "minEventTime" : null,
      "maxEventTime" : null
    } {code}
 

 

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---------------------------------------------------------------------------------------
>
>                 Key: HUDI-64
>                 URL: https://issues.apache.org/jira/browse/HUDI-64
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: Storage Management, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: help-requested
>             Fix For: 0.11.0
>
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to