Hi Rui and RMs of Flink 1.20,

Thanks for driving this!

Available information indicates this issue is environment- and
JDK-specific, and I also failed to reproduce it in my Mac. Thus I guess it
is caused by JIT behavior, which is unpredictable and vulnerable to
disturbance of the codebase. Considering the historical context of this
test provided by Piotr, I vote a "Won't fix" for this problem.

And I can offer some help if anyone wants to investigate the benchmark
environment, please reach out to me. JDK version info:

> openjdk version "11.0.19" 2023-04-18 LTS
> OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-2) (build 11.0.19+7-LTS)
> OpenJDK 64-Bit Server VM (Red_Hat-11.0.19.0.7-2) (build 11.0.19+7-LTS,
> mixed mode, sharing)

The OS version is Alibaba Cloud Linux 3.2104 LTS 64-bit[1]. The linux
kernel version is 5.10.134-15.al8.x86_64.


Best,
Zakelly

[1]
https://www.alibabacloud.com/help/en/alinux/product-overview/release-notes-for-alibaba-cloud-linux
(See: Alibaba Cloud Linux 3.2104 U8, image id:
aliyun_3_x64_20G_alibase_20230727.vhd)

On Tue, May 21, 2024 at 8:15 PM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi,
>
> Given what you wrote, that you have investigated the issue and couldn't
> find any easy explanation, I would suggest closing this ticket as "Won't
> do" or "Can not reproduce" and ignoring the problem.
>
> In the past there have been quite a bit of cases where some benchmark
> detected a performance regression. Sometimes those can not be reproduced,
> other times (as it's the case here), some seemingly unrelated change is
> causing the regression. The same thing happened in this benchmark many
> times in the past [1], [2], [3], [4]. Generally speaking this benchmark has
> been in the spotlight a couple of times [5].
>
> Note that there have been cases where this benchmark did detect a
> performance regression :)
>
> My personal suspicion is that after that commons-io version bump,
> something poked JVM/JIT to compile the code a bit differently for string
> serialization causing this regression. We have a couple of benchmarks that
> seem to be prone to such semi intermittent issues. For example the same
> benchmark was subject to this annoying pattern [6], that I've spotted in
> quite a bit of benchmarks over the years [6]:
>
> [image: image.png]
> (https://imgur.com/a/AoygmWS)
>
> Where benchmark results are very stable within a single JVM fork. But
> between two forks, they can reach two different "stable" levels. Here it
> looks like 50% of the chance of getting stable "200 records/ms" and 50%
> chances of "250 records/ms".
>
> A small interlude. Each of our benchmarks run in 3 different JVM forks, 10
> warm up iterations and 10 measurement iterations. Each iteration
> lasts/invokes the benchmarking method at least for one second. So by "very
> stable" results, I mean that for example after the 2nd or 3rd warm up
> iteration, the results stabilize < +/-1%, and stay on that level for the
> whole duration of the fork.
>
> Given that we are repeating the same benchmark in 3 different forks, we
> can have by pure chance:
> - 3 slow fork - total average 200 records/ms
> - 2 slow fork, 1 fast fork - average 216 r/ms
> - 1 slow fork, 2 fast forks - average 233 r/ms
> - 3 fast forks - average 250 r/ms
>
> So this benchmark is susceptible to enter some different semi stable
> states. As I wrote above, I guess something with the commons-io version
> bump just swayed it to a different semi stable state :( I have never gotten
> desperate enough to actually dig further what's exactly causing this kind
> of issues.
>
> Best,
> Piotrek
>
> [1] https://issues.apache.org/jira/browse/FLINK-18684
> [2] https://issues.apache.org/jira/browse/FLINK-27133
> [3] https://issues.apache.org/jira/browse/FLINK-27165
> [4] https://issues.apache.org/jira/browse/FLINK-31745
> [5]
> https://issues.apache.org/jira/browse/FLINK-35040?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20Resolved%2C%20Closed)%20AND%20text%20~%20%22serializerHeavyString%22
> [6]
> http://flink-speed.xyz/timeline/#/?exe=1&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=1000
>
> wt., 21 maj 2024 o 12:50 Rui Fan <1996fan...@gmail.com> napisał(a):
>
>> Hi devs:
>>
>> We(release managers of flink 1.20) wanna update one performance
>> regresses to the flink dev mail list.
>>
>> # Background:
>>
>> The performance of serializerHeavyString starts regress since April 3,
>> and we created FLINK-35040[1] to follow it.
>>
>> In brief:
>> - The performance only regresses for jdk 11, and Java 8 and Java 17 are
>> fine.
>> - The regression reason is upgrading commons-io version from 2.11.0 to
>> 2.15.1
>>   - This upgrading is done in FLINK-34955[2].
>>   - The performance can be recovered after reverting the commons-io
>> version
>> to 2.11.0
>>
>> You can get more details from FLINK-35040[1].
>>
>> # Problem
>>
>> We try to generate the flame graph (wall mode) to analyze why upgrading
>> the commons-io version affects the performance. These flamegraphs can
>> be found in FLINK-35040[1]. (Many thanks to Zakelly for generating these
>> flamegraphs from the benchmark server).
>>
>> Unfortunately, we cannot find any code of commons-io dependency is called.
>> Also, we try to analyze if any other dependencies are changed during
>> upgrading
>> commons-io version. The result is no, other dependencies are totally the
>> same.
>>
>> # Request
>>
>> After the above analysis, we cannot find why the performance of
>> serializerHeavyString
>> starts to regress for jdk11.
>>
>> We are looking forward to hearing valuable suggestions from the Flink
>> community.
>> Thanks everyone in advance.
>>
>> Note:
>> 1. I cannot reproduce the regression on my Mac with jdk11, and we suspect
>>   this regression may be caused by the benchmark environment.
>> 2. We will accept this regression if the issue still cannot be solved.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-35040
>> [2] https://issues.apache.org/jira/browse/FLINK-34955
>>
>> Best,
>> Weijie, Ufuk, Robert and Rui
>>
>

Reply via email to