Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3
It looks like the most recent run of JDK 11 saw a big improvement[1] of the performance of the test. That improvement seems related to [2] which is a fix for FLINK-35215 [3]. That suggests to me that the test isn't as isolated to the performance of the code its trying to test as would be ideal. However I've only just started looking at the test suite and trying to run locally so I'm not very well placed to judge. It does however suggest that this shouldn't be a blocker for the release. [1] http://flink-speed.xyz/changes/?rev=c1baf07d76&exe=6&env=3 [2] https://github.com/apache/flink/commit/c1baf07d7601a683f42997dc35dfaef4e41bc928 [3] https://issues.apache.org/jira/browse/FLINK-35215 On Wed, 22 May 2024 at 00:15, Piotr Nowojski wrote: > Hi, > > Given what you wrote, that you have investigated the issue and couldn't > find any easy explanation, I would suggest closing this ticket as "Won't > do" or "Can not reproduce" and ignoring the problem. > > In the past there have been quite a bit of cases where some benchmark > detected a performance regression. Sometimes those can not be reproduced, > other times (as it's the case here), some seemingly unrelated change is > causing the regression. The same thing happened in this benchmark many > times in the past [1], [2], [3], [4]. Generally speaking this benchmark has > been in the spotlight a couple of times [5]. > > Note that there have been cases where this benchmark did detect a > performance regression :) > > My personal suspicion is that after that commons-io version bump, > something poked JVM/JIT to compile the code a bit differently for string > serialization causing this regression. We have a couple of benchmarks that > seem to be prone to such semi intermittent issues. For example the same > benchmark was subject to this annoying pattern [6], that I've spotted in > quite a bit of benchmarks over the years [6]: > > [image: image.png] > (https://imgur.com/a/AoygmWS) > > Where benchmark results are very stable within a single JVM fork. But > between two forks, they can reach two different "stable" levels. Here it > looks like 50% of the chance of getting stable "200 records/ms" and 50% > chances of "250 records/ms". > > A small interlude. Each of our benchmarks run in 3 different JVM forks, 10 > warm up iterations and 10 measurement iterations. Each iteration > lasts/invokes the benchmarking method at least for one second. So by "very > stable" results, I mean that for example after the 2nd or 3rd warm up > iteration, the results stabilize < +/-1%, and stay on that level for the > whole duration of the fork. > > Given that we are repeating the same benchmark in 3 different forks, we > can have by pure chance: > - 3 slow fork - total average 200 records/ms > - 2 slow fork, 1 fast fork - average 216 r/ms > - 1 slow fork, 2 fast forks - average 233 r/ms > - 3 fast forks - average 250 r/ms > > So this benchmark is susceptible to enter some different semi stable > states. As I wrote above, I guess something with the commons-io version > bump just swayed it to a different semi stable state :( I have never gotten > desperate enough to actually dig further what's exactly causing this kind > of issues. > > Best, > Piotrek > > [1] https://issues.apache.org/jira/browse/FLINK-18684 > [2] https://issues.apache.org/jira/browse/FLINK-27133 > [3] https://issues.apache.org/jira/browse/FLINK-27165 > [4] https://issues.apache.org/jira/browse/FLINK-31745 > [5] > https://issues.apache.org/jira/browse/FLINK-35040?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20Resolved%2C%20Closed)%20AND%20text%20~%20%22serializerHeavyString%22 > [6] > http://flink-speed.xyz/timeline/#/?exe=1&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=1000 > > wt., 21 maj 2024 o 12:50 Rui Fan <1996fan...@gmail.com> napisał(a): > >> Hi devs: >> >> We(release managers of flink 1.20) wanna update one performance >> regresses to the flink dev mail list. >> >> # Background: >> >> The performance of serializerHeavyString starts regress since April 3, >> and we created FLINK-35040[1] to follow it. >> >> In brief: >> - The performance only regresses for jdk 11, and Java 8 and Java 17 are >> fine. >> - The regression reason is upgrading commons-io version from 2.11.0 to >> 2.15.1 >> - This upgrading is done in FLINK-34955[2]. >> - The performance can be recovered after reverting the commons-io >> version >> to 2.11.0 >> >> You can get more details from FLINK-35040[1]. >> >> # Problem >> >> We try to generate the flame graph (wall mode) to analyze why upgrading >> the commons-io version affects the performance. These flamegraphs can >> be found in FLINK-35040[1]. (Many thanks to Zakelly for generating these >> flamegraphs from the benchmark server). >> >> Unfortunately, we cannot find any code of commons-io dependency is called. >> Also, we try to analyze if any other dependencies are changed during >> upgrading >> commons-io version. The r
Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3
> I guess that improvement is a fluctuation. You can double check the performance results[1] of the last few days. The performance isn't recovered. Hmm yeah the improvement was a fluctuation and smaller than I remembered seeing (maybe I had zoomed into the timeline too much). > I fixed an issue related to kryo serialization in FLINK-35215. IIUC, serializerHeavyString doesn't use the kryo serialization. I try to run serializerHeavyString demo locally, and didn't see the kryo serialization related code is called. I don't see it either, but then again I don't see commons-io in the call stacks either despite the regression... I'm continuing to investigate the regression. On Mon, 27 May 2024 at 20:15, Rui Fan <1996fan...@gmail.com> wrote: > Thanks Sam for the comment! > > > It looks like the most recent run of JDK 11 saw a big improvement of the > > performance of the test. > > I guess that improvement is a fluctuation. You can double check the > performance results[1] of the last few days. The performance isn't > recovered. > > > > That improvement seems related to which is a fix for FLINK-35215. > > I fixed an issue related to kryo serialization in FLINK-35215. IIUC, > serializerHeavyString doesn't use the kryo serialization. I try to > run serializerHeavyString demo locally, and didn't see the > kryo serialization related code is called. > > Please correct me if I'm wrong, thanks~ > > [1] > > http://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200 > > Best, > Rui > > On Thu, May 23, 2024 at 1:27 PM Sam Barker wrote: > > > It looks like the most recent run of JDK 11 saw a big improvement[1] of > the > > performance of the test. That improvement seems related to [2] which is a > > fix for FLINK-35215 [3]. That suggests to me that the test isn't as > > isolated to the performance of the code its trying to test as would be > > ideal. However I've only just started looking at the test suite and > trying > > to run locally so I'm not very well placed to judge. > > > > It does however suggest that this shouldn't be a blocker for the release. > > > > > > > > [1] http://flink-speed.xyz/changes/?rev=c1baf07d76&exe=6&env=3 > > [2] > > > > > https://github.com/apache/flink/commit/c1baf07d7601a683f42997dc35dfaef4e41bc928 > > [3] https://issues.apache.org/jira/browse/FLINK-35215 > > > > On Wed, 22 May 2024 at 00:15, Piotr Nowojski > wrote: > > > > > Hi, > > > > > > Given what you wrote, that you have investigated the issue and couldn't > > > find any easy explanation, I would suggest closing this ticket as > "Won't > > > do" or "Can not reproduce" and ignoring the problem. > > > > > > In the past there have been quite a bit of cases where some benchmark > > > detected a performance regression. Sometimes those can not be > reproduced, > > > other times (as it's the case here), some seemingly unrelated change is > > > causing the regression. The same thing happened in this benchmark many > > > times in the past [1], [2], [3], [4]. Generally speaking this benchmark > > has > > > been in the spotlight a couple of times [5]. > > > > > > Note that there have been cases where this benchmark did detect a > > > performance regression :) > > > > > > My personal suspicion is that after that commons-io version bump, > > > something poked JVM/JIT to compile the code a bit differently for > string > > > serialization causing this regression. We have a couple of benchmarks > > that > > > seem to be prone to such semi intermittent issues. For example the same > > > benchmark was subject to this annoying pattern [6], that I've spotted > in > > > quite a bit of benchmarks over the years [6]: > > > > > > [image: image.png] > > > (https://imgur.com/a/AoygmWS) > > > > > > Where benchmark results are very stable within a single JVM fork. But > > > between two forks, they can reach two different "stable" levels. Here > it > > > looks like 50% of the chance of getting stable "200 records/ms" and 50% > > > chances of "250 records/ms". > > > > > > A small interlude. Each of our benchmarks run in 3 different JVM forks, > > 10 > > > warm up iterations and 10 measurement iterations. Each iteration > > > lasts/invokes the benchmarking method at least for
Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3
After completing the side quest <https://github.com/apache/flink-benchmarks/pull/90>[1] of enabling async profiler when running the JMH benchmarks I've been unable to reproduce the performance change between the last known good run and the first run highlighted as a regression. Results from my fedora f40 workstation using # JMH version: 1.37 # VM version: JDK 11.0.23, OpenJDK 64-Bit Server VM, 11.0.23+9 # VM invoker: /home/sam/.sdkman/candidates/java/11.0.23-tem/bin/java # VM options: -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.ssl # Blackhole mode: full + dont-inline hint (auto-detected, use -Djmh.blackhole.autoDetect=false to disable) ───┬ │ File: /tmp/profile-results/163b9cca6d2/jmh-result.csv ───┼ 1 │ "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" 2 │ "org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString","thrpt",1,30,179.453066,5.725733,"ops/ms" 3 │ "org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString:async","thrpt",1,1,NaN,NaN,"---" ───┴ ───┬ │ File: /tmp/profile-results/f38d8ca43f6/jmh-result.csv ───┼ 1 │ "Benchmark","Mode","Threads","Samples","Score","Score Error (99.9%)","Unit" 2 │ "org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString","thrpt",1,30,178.861842,6.711582,"ops/ms" 3 │ "org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString:async","thrpt",1,1,NaN,NaN,"---" ───┴── Where f38d8ca43f6 is the last known good run and 163b9cca6d2 is the first regression. One question I have from comparing my local results to those on flink-speed <https://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200>[2] is it possible the JDK version changed between the runs (I don't see the actual JDK build listed anywhere so I can't check versions or distributions)? I've also tried comparing building flink with the java11-target profile vs the default JDK 8 build and that does not change the performance. Sam [1] https://github.com/apache/flink-benchmarks/pull/90 [2] https://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200 On Wed, 29 May 2024 at 16:53, Sam Barker wrote: > > I guess that improvement is a fluctuation. You can double check the > performance results[1] of the last few days. The performance isn't > recovered. > > Hmm yeah the improvement was a fluctuation and smaller than I remembered > seeing (maybe I had zoomed into the timeline too much). > > > I fixed an issue related to kryo serialization in FLINK-35215. IIUC, > serializerHeavyString doesn't use the kryo serialization. I try to > run serializerHeavyString demo locally, and didn't see the > kryo serialization related code is called. > > I don't see it either, but then again I don't see commons-io in the call > stacks either despite the regression... > > I'm continuing to investigate the regression. > > On Mon, 27 May 2024 at 20:15, Rui Fan <1996fan...@gmail.com> wrote: > >> Thanks Sam for the comment! >> >> > It looks like the most recent run of JDK 11 saw a big improvement of the >> > performance of the test. >> >> I guess that improvement is a fluctuation. You can double check the >> performance results[1] of the last few days. The performance isn't >> recovered. >> > > > >> >> > That improvement seems related
Re: [VOTE] Apache Flink Kubernetes Operator Release 1.10.0, release candidate #1
+1 (non binding) I've verified: - The src tarball builds and passes mvn verify (with Java 17), - the sha512 signatures for both srcs and helm - The GPG - Checked all poms are version 1.10.0 - Checked all poms do not contain -SNAPSHOT dependencies - Verified that the chart and appVersions from the helm tarball are 1.10.0 - Verified the chart points at the appropriate image - Verified that the the RC repo works as a helm Repo On Wed, 16 Oct 2024 at 05:18, Őrhidi Mátyás wrote: > Hi everyone, > > Please review and vote on the release candidate #1 for the version 1.10.0 > of Apache Flink Kubernetes Operator, > as follows: > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > **Release Overview** > > As an overview, the release consists of the following: > a) Kubernetes Operator canonical source distribution (including the > Dockerfile), to be deployed to the release repository at dist.apache.org > b) Kubernetes Operator Helm Chart to be deployed to the release repository > at dist.apache.org > c) Maven artifacts to be deployed to the Maven Central Repository > d) Docker image to be pushed to dockerhub > > **Staging Areas to Review** > > The staging areas containing the above mentioned artifacts are as follows, > for your review: > * All artifacts for a,b) can be found in the corresponding dev repository > at dist.apache.org [1] > * All artifacts for c) can be found at the Apache Nexus Repository [2] > * The docker image for d) is staged on github [3] > > All artifacts are signed with the key 48E78F054AA33CB5 [4] > > Other links for your review: > * JIRA release notes [5] > * source code tag "release-1.1.0-rc1" [6] > * PR to update the website Downloads page to include Kubernetes Operator > links [7] > > **Vote Duration** > > The voting time will run for at least 72 hours. > It is adopted by majority approval, with at least 3 PMC affirmative votes. > > **Note on Verification** > > You can follow the basic verification guide here[8]. > Note that you don't need to verify everything yourself, but please make > note of what you have tested together with your +- vote. > > Thanks, > Matyas > > [1] > > https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.10.0-rc1/ > [2] https://repository.apache.org/content/repositories/orgapacheflink-1762 > [3] ghcr.io/apache/flink-kubernetes-operator:c703255 > [4]https://dist.apache.org/repos/dist/release/flink/KEYS > [5] > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354833 > [6] > > https://github.com/apache/flink-kubernetes-operator/releases/tag/release-1.10.0-rc1 > [7] https://github.com/apache/flink-web/pull/758 > [8] > > https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release > > > > > > > > ghcr.io/apache/flink-kubernetes-operator:c703255 >
[jira] [Created] (FLINK-36031) Migrate kubernetes operator metrics from OKHttp Interceptor API to Fabric8 Interceptor API
Sam Barker created FLINK-36031: -- Summary: Migrate kubernetes operator metrics from OKHttp Interceptor API to Fabric8 Interceptor API Key: FLINK-36031 URL: https://issues.apache.org/jira/browse/FLINK-36031 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.9.0 Reporter: Sam Barker Fix For: kubernetes-operator-1.10.0 I noticed [this comment|https://github.com/apache/flink-kubernetes-operator/blob/8479f2267620cf6cb7a5cda981a769b8a0262184/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/KubernetesClientUtils.java#L56-L58] when investigating something in the operator code base. So followed through to check if the Fabric8 interceptor API was ready. It wasn't quite ready but now that this [kubernetes-client PR|https://github.com/fabric8io/kubernetes-client/pull/6144] has been merged and released as part of Fabric8 kubernetes-client [6.13.2|https://github.com/fabric8io/kubernetes-client/issues/6097]. This change can now be made in the operator and I believe that is a refactoring as it has no externally visible effects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36471) Flink Operator uses out of date Docker actions.
Sam Barker created FLINK-36471: -- Summary: Flink Operator uses out of date Docker actions. Key: FLINK-36471 URL: https://issues.apache.org/jira/browse/FLINK-36471 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Sam Barker Assignee: Sam Barker Fix For: kubernetes-operator-1.10.0 The operator test suite if full of warnings about using out of date github actions based on node 12 an example [picked at random|https://github.com/apache/flink-kubernetes-operator/actions/runs/10915107524] The problematic actions are: * {{actions/checkout@v2}} * {{actions/setup-java@v2}} * {{actions/cache@v3}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36473) Operator metrics tests fail intermittently
Sam Barker created FLINK-36473: -- Summary: Operator metrics tests fail intermittently Key: FLINK-36473 URL: https://issues.apache.org/jira/browse/FLINK-36473 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Sam Barker {{RestApiMetricsCollectorTest}} & {{KubernetesClientMetricsTest}} fail intermittently as the metrics are lazily initialised the assertions can be executed before the metic initialisation has been propagated. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup
Sam Barker created FLINK-36460: -- Summary: Refactor Kubernetes Operator Edge to Edge test setup Key: FLINK-36460 URL: https://issues.apache.org/jira/browse/FLINK-36460 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Sam Barker The edge to edge operator tests are getting difficult to control and evolve as was noticed in FLINK-36332. We should try and find clear seems to break the test suite up along to avoid running lots of extra tests or creating mutually exclusive include and exclude conditions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36392) Varying the Java version in the Operator CI has no effect
Sam Barker created FLINK-36392: -- Summary: Varying the Java version in the Operator CI has no effect Key: FLINK-36392 URL: https://issues.apache.org/jira/browse/FLINK-36392 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Sam Barker [FLINK-33471|https://issues.apache.org/jira/browse/FLINK-33471] & [FLINK-33359|https://issues.apache.org/jira/browse/FLINK-33359] both modified the Kubernetes Operator edge to edge test suite to support to support a variety of java versions in GitHub CI. However as far as I can tell this well intentioned move hasn't actually had the desired effect. As the JVM which actually runs the maven build of the operator is executed within the context of a Docker container based on [temurin-11|https://github.com/apache/flink-kubernetes-operator/blob/d946f3f9f3a7f12098cd82db2545de7c89e220ff/Dockerfile#L19]. Therefore all the tests are actually executed by a java operator built and running under JDK/JRE 11. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36393) Flink Operator uses out of date GitHub actions.
Sam Barker created FLINK-36393: -- Summary: Flink Operator uses out of date GitHub actions. Key: FLINK-36393 URL: https://issues.apache.org/jira/browse/FLINK-36393 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Sam Barker The operator test suite if full of warnings about using out of date github actions based on node 12 an example [picked at random|https://github.com/apache/flink-kubernetes-operator/actions/runs/10915107524] The problematic actions are: * {{actions/checkout@v2}} * {{actions/setup-java@v2}} * {{actions/cache@v3}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-36332) Allow the Operator http client to be customised
Sam Barker created FLINK-36332: -- Summary: Allow the Operator http client to be customised Key: FLINK-36332 URL: https://issues.apache.org/jira/browse/FLINK-36332 Project: Flink Issue Type: Improvement Reporter: Sam Barker We are looking to produce a build of the Flink Kubernetes operator however for internal policy reasons we need to exclude the Kotlin dependencies. Kotlin is a transitive dependency of OkHttp and now that https://issues.apache.org/jira/browse/FLINK-36031 has been merged OkHttp is entirely optional (but a sensible default). The Fabric8 project explicitly support supplying alternative http clients (see https://github.com/fabric8io/kubernetes-client/blob/main/doc/FAQ.md#what-artifacts-should-my-project-depend-on) and the common pattern as demonstrated by the [java-operator-sdk|https://github.com/operator-framework/java-operator-sdk/blob/24494cb6342a5c75dff9a6962156ff488ad0c818/pom.xml#L44] is to define a property with the name of the client implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010)