[ https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dong Lin updated FLINK-31125: ----------------------------- Summary: Flink ML benchmark framework should minimize the source operator overhead (was: Flink ML benchmark result should not include data generation overhead) > Flink ML benchmark framework should minimize the source operator overhead > ------------------------------------------------------------------------- > > Key: FLINK-31125 > URL: https://issues.apache.org/jira/browse/FLINK-31125 > Project: Flink > Issue Type: Improvement > Components: Library / Machine Learning > Reporter: Dong Lin > Assignee: Dong Lin > Priority: Major > Fix For: ml-2.2.0 > > > Flink ML benchmark framework estimates the throughput by having a source > operator generate a given number (e.g. 10^7) of input records with random > values, let the given AlgoOperator process these input records, and divide > the number of records by the total execution time. > The overhead of generating random values for all input records has observable > impact on the estimated throughput. We would like to minimize the overhead of > the source operator so that the benchmark result can focus on the throughput > of the AlgoOperator as much as possible. > Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] > generates all input records in advance into memory before running the > benchmark. This allows Spark ML benchmark to read records from memory instead > of generating values for those records during the benchmark. > We can generate value once and re-use it for all input records. This approach > minimizes the overhead of source operator and allow us to compare the Flink > ML benchmark result with Spark ML benchmark result (using spark-sql-perf) > fairly. -- This message was sent by Atlassian Jira (v8.20.10#820010)