[ 
https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694972#comment-17694972
 ] 

Dong Lin commented on FLINK-31125:
----------------------------------

Merged to flink-ml master branch 5bcbd01169a3bfce0c92e0aae56dc55cf2489d37

> Flink ML benchmark framework should minimize the source operator overhead
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31125
>                 URL: https://issues.apache.org/jira/browse/FLINK-31125
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: ml-2.2.0
>
>
> Flink ML benchmark framework estimates the throughput by having a source 
> operator generate a given number (e.g. 10^7) of input records with random 
> values, let the given AlgoOperator process these input records, and divide 
> the number of records by the total execution time. 
> The overhead of generating random values for all input records has observable 
> impact on the estimated throughput. We would like to minimize the overhead of 
> the source operator so that the benchmark result can focus on the throughput 
> of the AlgoOperator as much as possible.
> Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] 
> generates all input records in advance into memory before running the 
> benchmark. This allows Spark ML benchmark to read records from memory instead 
> of generating values for those records during the benchmark.
> We can generate value once and re-use it for all input records. This approach 
> minimizes the source operator head and allows us to compare Flink ML 
> benchmark result with Spark ML benchmark result (from spark-sql-perf) fairly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to