[GitHub] [flink-ml] lindong28 commented on a diff in pull request #111: [FLINK-27877] Improve StringIndexer performance and optimize flink-ml-bench

GitBox Fri, 17 Jun 2022 07:53:33 -0700


lindong28 commented on code in PR #111:
URL: https://github.com/apache/flink-ml/pull/111#discussion_r900198528



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkUtils.java:
##########
@@ -99,6 +99,7 @@ private static BenchmarkResult runBenchmark(
             DataGenerator<?> modelDataGenerator)
             throws Exception {
         StreamExecutionEnvironment env = 
TableUtils.getExecutionEnvironment(tEnv);
+        env.getConfig().enableObjectReuse();

Review Comment:
   According to [1], `enableObjectReuse` can lead to bugs when the user-code 
function of an operation is not aware of this behavior. So we can avoid bugs if 
users take care when writing UDF.
   
   If we don't enable object re-use here, users won't be able to reproduce the 
best possible performance by running our benchmarks, which weakens the purpose 
of adding these benchmarks.
   
   I am not sure the two points mentioned above are valid concerns:
   - We should be able to make sure that all Flink ML algorithms can run with 
object-reuse enabled. And it is useful to catch and fix issues by running 
benchmarks with this config.
   - User's production code typically won't use BenchmarkUtils. Thus their 
production code won't be affected by the `enableObjectReuse()` added here.
   - While Flink users can change configuration parameters, it is generally 
preferred to provide the best set of configuration parameters in our benchmark 
setup with our best effort.
   
   [1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkUtils.java:
##########
@@ -99,6 +99,7 @@ private static BenchmarkResult runBenchmark(
             DataGenerator<?> modelDataGenerator)
             throws Exception {
         StreamExecutionEnvironment env = 
TableUtils.getExecutionEnvironment(tEnv);
+        env.getConfig().enableObjectReuse();

Review Comment:
   According to [1], `enableObjectReuse` can lead to bugs when the user-code 
function of an operation is not aware of this behavior. So we can avoid bugs if 
users take care when writing UDF.
   
   If we don't enable object re-use here, users won't be able to reproduce the 
best possible performance by running our benchmarks, which weakens the purpose 
of adding these benchmarks.
   
   I am not sure the two points mentioned above are valid concerns:
   - We should be able to make sure that all Flink ML algorithms can run with 
object-reuse enabled. And it is useful to catch and fix issues by running 
benchmarks with this config.
   - User's production code typically won't use BenchmarkUtils. Thus their 
production code won't be affected by the `enableObjectReuse()` added here.
   - While Flink users can change configuration parameters, it is still 
generally preferred to provide the best set of configuration parameters in our 
benchmark setup with our best effort.
   
   [1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution/execution_configuration/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink-ml] lindong28 commented on a diff in pull request #111: [FLINK-27877] Improve StringIndexer performance and optimize flink-ml-bench

Reply via email to