[ https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147739#comment-17147739 ]
Piotr Nowojski commented on FLINK-18433: ---------------------------------------- [~AHeise], this metrics comes from: {{org.apache.flink.runtime.io.network.api.writer.RecordWriter#numBuffersOut}} (if you do a text search through the code it might be hard to find because of the way how it's name is defined/constructed {{MetricNames#IO_NUM_BUFFERS_OUT_RATE = IO_NUM_BUFFERS_OUT + SUFFIX_RATE;}}. Meta note [~Aihua], using number of buffers per second as a measurement of the throughput might be miss-leading. The actual records throughput might be the same or even better if we utilise single buffer more for whatever a reason (if the buffers we are sending are more full). Could you and/or [~AHeise] verify the actual throughput and double check if the number of buffers per second has indeed decreased? As [~AHeise] suggested it theoretically might also be because of some other measurement faults. If that's the case, It might be interesting on its own to investigate why the buffers are more full. Maybe we have increased the overhead per buffer but thanks to low latency changes we are now sending more data per buffer (increasing latency a bit). > From the end-to-end performance test results, 1.11 has a regression > ------------------------------------------------------------------- > > Key: FLINK-18433 > URL: https://issues.apache.org/jira/browse/FLINK-18433 > Project: Flink > Issue Type: Bug > Components: API / Core, API / DataStream > Affects Versions: 1.11.0 > Environment: 3 machines > [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > Reporter: Aihua Li > Priority: Major > Attachments: flink_11.log.gz > > > > I ran end-to-end performance tests between the Release-1.10 and Release-1.11. > the results were as follows: > |scenarioName|release-1.10|release-1.11| | > |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.81333333|-5.11%| > |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%| > |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.323333|-5.97%| > |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%| > |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.6883333|-5.85%| > |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.123333|-8.81%| > |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.91666667|43.09833333|-6.14%| > |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.7266667|-5.35%| > |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%| > |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.55166667|-5.57%| > |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.0383333|200.3883333|-5.49%| > |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%| > |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.05833333|43.49666667|-5.56%| > |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.2333333|201.1883333|-5.20%| > |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.663333|1616.85|-6.03%| > |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.62333333|-5.45%| > |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.9183333|152.9566667|-2.52%| > |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%| > |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.29666667|34.16666667|-0.38%| > |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.3533333|151.8483333|-4.11%| > |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%| > |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.57166667|32.09666667|-7.16%| > |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%| > |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%| > |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%| > |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.5883333|145.9966667|-2.40%| > |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%| > |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.68333333|-12.76%| > |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.3033333|151.4616667|-3.71%| > |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%| > |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%| > |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%| > It can be seen that the performance of 1.11 has a regression, basically > around 5%, and the maximum regression is 17%. This needs to be checked. > the test code: > flink-1.10.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > flink-1.11.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > commit cmd like tis: > bin/flink run -d -m 192.168.39.246:8081 -c > org.apache.flink.basic.operations.PerformanceTestJob > /home/admin/flink-basic-operations_2.11-1.10-SNAPSHOT.jar --topologyName > OneInput --LogicalAttributesofEdges Broadcast --ScheduleMode LazyFromSource > --CheckpointMode ExactlyOnce --recordSize 10 --stateBackend rocksdb > -- This message was sent by Atlassian Jira (v8.3.4#803005)