[ 
https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated FLINK-31125:
-----------------------------
    Summary: Flink ML benchmark framework should minimize the source operator 
overhead  (was: Flink ML benchmark result should not include data generation 
overhead)

> Flink ML benchmark framework should minimize the source operator overhead
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31125
>                 URL: https://issues.apache.org/jira/browse/FLINK-31125
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Major
>             Fix For: ml-2.2.0
>
>
> Flink ML benchmark framework estimates the throughput by having a source 
> operator generate a given number (e.g. 10^7) of input records with random 
> values, let the given AlgoOperator process these input records, and divide 
> the number of records by the total execution time. 
> The overhead of generating random values for all input records has observable 
> impact on the estimated throughput. We would like to minimize the overhead of 
> the source operator so that the benchmark result can focus on the throughput 
> of the AlgoOperator as much as possible.
> Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] 
> generates all input records in advance into memory before running the 
> benchmark. This allows Spark ML benchmark to read records from memory instead 
> of generating values for those records during the benchmark.
> We can generate value once and re-use it for all input records. This approach 
> minimizes the overhead of source operator and allow us to compare the Flink 
> ML benchmark result with Spark ML benchmark result (using spark-sql-perf) 
> fairly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to