lindong28 commented on code in PR #111: URL: https://github.com/apache/flink-ml/pull/111#discussion_r900198528
########## flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkUtils.java: ########## @@ -99,6 +99,7 @@ private static BenchmarkResult runBenchmark( DataGenerator<?> modelDataGenerator) throws Exception { StreamExecutionEnvironment env = TableUtils.getExecutionEnvironment(tEnv); + env.getConfig().enableObjectReuse(); Review Comment: According to [1], `enableObjectReuse` can lead to bugs when the user-code function of an operation is not aware of this behavior. So we can avoid bugs if users take care when writing UDF. If we don't enable object re-use here, users won't be able to reproduce the best possible performance by running our benchmarks, which weakens the purpose of adding these benchmarks. I am not sure the two points mentioned above are valid concerns: - We should be able to make sure that all Flink ML algorithms can run with object-reuse enabled. And it is useful to catch and fix issues by running benchmarks with this config. - User's production code typically won't use BenchmarkUtils. Thus their production code won't be affected by the `enableObjectReuse()` added here. - While Flink users can change configuration parameters, it is generally preferred to provide the best set of configuration parameters in our benchmark setup with our best effort. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/ ########## flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkUtils.java: ########## @@ -99,6 +99,7 @@ private static BenchmarkResult runBenchmark( DataGenerator<?> modelDataGenerator) throws Exception { StreamExecutionEnvironment env = TableUtils.getExecutionEnvironment(tEnv); + env.getConfig().enableObjectReuse(); Review Comment: According to [1], `enableObjectReuse` can lead to bugs when the user-code function of an operation is not aware of this behavior. So we can avoid bugs if users take care when writing UDF. If we don't enable object re-use here, users won't be able to reproduce the best possible performance by running our benchmarks, which weakens the purpose of adding these benchmarks. I am not sure the two points mentioned above are valid concerns: - We should be able to make sure that all Flink ML algorithms can run with object-reuse enabled. And it is useful to catch and fix issues by running benchmarks with this config. - User's production code typically won't use BenchmarkUtils. Thus their production code won't be affected by the `enableObjectReuse()` added here. - While Flink users can change configuration parameters, it is still generally preferred to provide the best set of configuration parameters in our benchmark setup with our best effort. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org