[
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972319#comment-16972319
]
vinoyang commented on HUDI-64:
------------------------------
Hi [~vinoth] IMO, we can refer to Flink's optimizer of batch processing.
There are some components:
* {{Costs}} which is a data structure describes the costs, there are two costs:
Quantifiable costs and Heuristic costs;
* {{EstimateProvider}} a provider which defines some methods for operators and
connections that provide estimated about data size and characteristics;
* {{CostEstimator}} a cost estimator which defines cost estimation methods and
implements the basic work method that computes the cost of an operator by
adding input shipping cost, input local cost and driver cost;
* {{CompilerHints}} A class encapsulating compiler hints describing the
behavior of the user function. If set, the optimizer will use them to estimate
the sizes of the intermediate results. Note that these values are optional
hints, the optimizer will always generate a valid plan without them as well.
The hints may help, however, to improve the plan choice.
These components work together to do the optimization for Fink batch processing
before running. More details please see here:
https://github.com/apache/flink/blob/master/flink-optimizer/src/main/java/org/apache/flink/optimizer/costs/Costs.java
IMO, we can refer to this mechanism to do the optimization and collect enough
metrics.
WDYT?
> Estimation of compression ratio & other dynamic storage knobs based on
> historical stats
> ---------------------------------------------------------------------------------------
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: Storage Management, Write Client
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
>
> Something core to Hudi writing is using heuristics or runtime workload
> statistics to optimize aspects of storage like file sizes, partitioning and
> so on.
> Below lists all such places.
>
> # Compression ratio for parquet
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
> . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it
> has written for a given parquet file and closes the parquet file once the
> configured size has reached. DFSOutputStream level we only know bytes written
> before compression. Once enough data has been written, it should be possible
> to replace this by a simple estimate of what the avg record size would be
> (commit metadata would give you size and number of records in each file)
> # Very similar problem exists for log files
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
> We write data into logs in avro and can log updates to same record in
> parquet multiple times. We need to estimate again how large the log file(s)
> can grow to, and still we would be able to produce a parquet file of
> configured size during compaction. (hope I conveyed this clearly)
> # WorkloadProfile :
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
> caches the input records using Spark Caching and computes the shape of the
> workload, i.e how many records per partition, how many inserts vs updates
> etc. This is used by the Partitioner here
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
> for assigning records to a file group. This is the critical one to replace
> for Flink support and probably the hardest, since we need to guess input,
> which is not always possible?
> # Within partitioner, we already derive a simple average size per record
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
> from the last commit metadata alone. This can be generalized. (default :
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>
> #
> Our goal in this Jira is to see, if could derive this information in the
> background purely using the commit metadata.. Some parts of this are
> open-ended.. Good starting point would be to see whats feasible, estimate ROI
> before aactually implementing
>
>
>
>
>
>
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)