Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3

2024-05-22 Thread Sam Barker
It looks like the most recent run of JDK 11 saw a big improvement[1] of the
performance of the test. That improvement seems related to [2] which is a
fix for FLINK-35215 [3]. That suggests to me that the test isn't as
isolated to the performance of the code its trying to test as would be
ideal. However I've only just started looking at the test suite and trying
to run locally so I'm not very well placed to judge.

It does however suggest that this shouldn't be a blocker for the release.



[1] http://flink-speed.xyz/changes/?rev=c1baf07d76&exe=6&env=3
[2]
https://github.com/apache/flink/commit/c1baf07d7601a683f42997dc35dfaef4e41bc928
[3] https://issues.apache.org/jira/browse/FLINK-35215

On Wed, 22 May 2024 at 00:15, Piotr Nowojski  wrote:

> Hi,
>
> Given what you wrote, that you have investigated the issue and couldn't
> find any easy explanation, I would suggest closing this ticket as "Won't
> do" or "Can not reproduce" and ignoring the problem.
>
> In the past there have been quite a bit of cases where some benchmark
> detected a performance regression. Sometimes those can not be reproduced,
> other times (as it's the case here), some seemingly unrelated change is
> causing the regression. The same thing happened in this benchmark many
> times in the past [1], [2], [3], [4]. Generally speaking this benchmark has
> been in the spotlight a couple of times [5].
>
> Note that there have been cases where this benchmark did detect a
> performance regression :)
>
> My personal suspicion is that after that commons-io version bump,
> something poked JVM/JIT to compile the code a bit differently for string
> serialization causing this regression. We have a couple of benchmarks that
> seem to be prone to such semi intermittent issues. For example the same
> benchmark was subject to this annoying pattern [6], that I've spotted in
> quite a bit of benchmarks over the years [6]:
>
> [image: image.png]
> (https://imgur.com/a/AoygmWS)
>
> Where benchmark results are very stable within a single JVM fork. But
> between two forks, they can reach two different "stable" levels. Here it
> looks like 50% of the chance of getting stable "200 records/ms" and 50%
> chances of "250 records/ms".
>
> A small interlude. Each of our benchmarks run in 3 different JVM forks, 10
> warm up iterations and 10 measurement iterations. Each iteration
> lasts/invokes the benchmarking method at least for one second. So by "very
> stable" results, I mean that for example after the 2nd or 3rd warm up
> iteration, the results stabilize < +/-1%, and stay on that level for the
> whole duration of the fork.
>
> Given that we are repeating the same benchmark in 3 different forks, we
> can have by pure chance:
> - 3 slow fork - total average 200 records/ms
> - 2 slow fork, 1 fast fork - average 216 r/ms
> - 1 slow fork, 2 fast forks - average 233 r/ms
> - 3 fast forks - average 250 r/ms
>
> So this benchmark is susceptible to enter some different semi stable
> states. As I wrote above, I guess something with the commons-io version
> bump just swayed it to a different semi stable state :( I have never gotten
> desperate enough to actually dig further what's exactly causing this kind
> of issues.
>
> Best,
> Piotrek
>
> [1] https://issues.apache.org/jira/browse/FLINK-18684
> [2] https://issues.apache.org/jira/browse/FLINK-27133
> [3] https://issues.apache.org/jira/browse/FLINK-27165
> [4] https://issues.apache.org/jira/browse/FLINK-31745
> [5]
> https://issues.apache.org/jira/browse/FLINK-35040?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened%2C%20Resolved%2C%20Closed)%20AND%20text%20~%20%22serializerHeavyString%22
> [6]
> http://flink-speed.xyz/timeline/#/?exe=1&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=2&revs=1000
>
> wt., 21 maj 2024 o 12:50 Rui Fan <1996fan...@gmail.com> napisał(a):
>
>> Hi devs:
>>
>> We(release managers of flink 1.20) wanna update one performance
>> regresses to the flink dev mail list.
>>
>> # Background:
>>
>> The performance of serializerHeavyString starts regress since April 3,
>> and we created FLINK-35040[1] to follow it.
>>
>> In brief:
>> - The performance only regresses for jdk 11, and Java 8 and Java 17 are
>> fine.
>> - The regression reason is upgrading commons-io version from 2.11.0 to
>> 2.15.1
>>   - This upgrading is done in FLINK-34955[2].
>>   - The performance can be recovered after reverting the commons-io
>> version
>> to 2.11.0
>>
>> You can get more details from FLINK-35040[1].
>>
>> # Problem
>>
>> We try to generate the flame graph (wall mode) to analyze why upgrading
>> the commons-io version affects the performance. These flamegraphs can
>> be found in FLINK-35040[1]. (Many thanks to Zakelly for generating these
>> flamegraphs from the benchmark server).
>>
>> Unfortunately, we cannot find any code of commons-io dependency is called.
>> Also, we try to analyze if any other dependencies are changed during
>> upgrading
>> commons-io version. The r

Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3

2024-05-28 Thread Sam Barker
> I guess that improvement is a fluctuation. You can double check the
performance results[1] of the last few days. The performance isn't
recovered.

Hmm yeah the improvement was a fluctuation and smaller than I remembered
seeing (maybe I had zoomed into the timeline too much).

> I fixed an issue related to kryo serialization in FLINK-35215. IIUC,
serializerHeavyString doesn't use the kryo serialization. I try to
run serializerHeavyString demo locally, and didn't see the
kryo serialization related code is called.

I don't see it either, but then again I don't see commons-io in the call
stacks either despite the regression...

I'm continuing to investigate the regression.

On Mon, 27 May 2024 at 20:15, Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Sam for the comment!
>
> > It looks like the most recent run of JDK 11 saw a big improvement of the
> > performance of the test.
>
> I guess that improvement is a fluctuation. You can double check the
> performance results[1] of the last few days. The performance isn't
> recovered.
>



>
> > That improvement seems related to which is a fix for FLINK-35215.
>
> I fixed an issue related to kryo serialization in FLINK-35215. IIUC,
> serializerHeavyString doesn't use the kryo serialization. I try to
> run serializerHeavyString demo locally, and didn't see the
> kryo serialization related code is called.
>
> Please correct me if I'm wrong, thanks~
>
> [1]
>
> http://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200
>
> Best,
> Rui
>
> On Thu, May 23, 2024 at 1:27 PM Sam Barker  wrote:
>
> > It looks like the most recent run of JDK 11 saw a big improvement[1] of
> the
> > performance of the test. That improvement seems related to [2] which is a
> > fix for FLINK-35215 [3]. That suggests to me that the test isn't as
> > isolated to the performance of the code its trying to test as would be
> > ideal. However I've only just started looking at the test suite and
> trying
> > to run locally so I'm not very well placed to judge.
> >
> > It does however suggest that this shouldn't be a blocker for the release.
> >
> >
> >
> > [1] http://flink-speed.xyz/changes/?rev=c1baf07d76&exe=6&env=3
> > [2]
> >
> >
> https://github.com/apache/flink/commit/c1baf07d7601a683f42997dc35dfaef4e41bc928
> > [3] https://issues.apache.org/jira/browse/FLINK-35215
> >
> > On Wed, 22 May 2024 at 00:15, Piotr Nowojski 
> wrote:
> >
> > > Hi,
> > >
> > > Given what you wrote, that you have investigated the issue and couldn't
> > > find any easy explanation, I would suggest closing this ticket as
> "Won't
> > > do" or "Can not reproduce" and ignoring the problem.
> > >
> > > In the past there have been quite a bit of cases where some benchmark
> > > detected a performance regression. Sometimes those can not be
> reproduced,
> > > other times (as it's the case here), some seemingly unrelated change is
> > > causing the regression. The same thing happened in this benchmark many
> > > times in the past [1], [2], [3], [4]. Generally speaking this benchmark
> > has
> > > been in the spotlight a couple of times [5].
> > >
> > > Note that there have been cases where this benchmark did detect a
> > > performance regression :)
> > >
> > > My personal suspicion is that after that commons-io version bump,
> > > something poked JVM/JIT to compile the code a bit differently for
> string
> > > serialization causing this regression. We have a couple of benchmarks
> > that
> > > seem to be prone to such semi intermittent issues. For example the same
> > > benchmark was subject to this annoying pattern [6], that I've spotted
> in
> > > quite a bit of benchmarks over the years [6]:
> > >
> > > [image: image.png]
> > > (https://imgur.com/a/AoygmWS)
> > >
> > > Where benchmark results are very stable within a single JVM fork. But
> > > between two forks, they can reach two different "stable" levels. Here
> it
> > > looks like 50% of the chance of getting stable "200 records/ms" and 50%
> > > chances of "250 records/ms".
> > >
> > > A small interlude. Each of our benchmarks run in 3 different JVM forks,
> > 10
> > > warm up iterations and 10 measurement iterations. Each iteration
> > > lasts/invokes the benchmarking method at least for

Re: [DISCUSS] The performance of serializerHeavyString starts regress since April 3

2024-06-09 Thread Sam Barker
After completing the side quest
<https://github.com/apache/flink-benchmarks/pull/90>[1] of enabling async
profiler when running the JMH benchmarks I've been unable to reproduce the
performance change between the last known good run and the first run
highlighted as a regression.
Results from my fedora f40 workstation using

# JMH version: 1.37
# VM version: JDK 11.0.23, OpenJDK 64-Bit Server VM, 11.0.23+9
# VM invoker: /home/sam/.sdkman/candidates/java/11.0.23-tem/bin/java
# VM options: -Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.ssl
# Blackhole mode: full + dont-inline hint (auto-detected, use
-Djmh.blackhole.autoDetect=false to disable)
───┬
   │ File: /tmp/profile-results/163b9cca6d2/jmh-result.csv
───┼
   1   │ "Benchmark","Mode","Threads","Samples","Score","Score Error
(99.9%)","Unit"
   2   │
"org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString","thrpt",1,30,179.453066,5.725733,"ops/ms"
   3   │
"org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString:async","thrpt",1,1,NaN,NaN,"---"
───┴
───┬
   │ File: /tmp/profile-results/f38d8ca43f6/jmh-result.csv
───┼
   1   │ "Benchmark","Mode","Threads","Samples","Score","Score Error
(99.9%)","Unit"
   2   │
"org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString","thrpt",1,30,178.861842,6.711582,"ops/ms"
   3   │
"org.apache.flink.benchmark.SerializationFrameworkMiniBenchmarks.serializerHeavyString:async","thrpt",1,1,NaN,NaN,"---"
───┴──

Where f38d8ca43f6 is the last known good run and 163b9cca6d2 is the first
regression.

One question I have from comparing my local results to those on flink-speed
<https://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200>[2]
is it possible the JDK version changed between the runs (I don't see the
actual JDK build listed anywhere so I can't check versions or
distributions)?

I've also tried comparing building flink with the java11-target profile vs
the default JDK 8 build and that does not change the performance.

Sam

[1] https://github.com/apache/flink-benchmarks/pull/90
[2]
https://flink-speed.xyz/timeline/#/?exe=6&ben=serializerHeavyString&extr=on&quarts=on&equid=off&env=3&revs=200

On Wed, 29 May 2024 at 16:53, Sam Barker  wrote:

> > I guess that improvement is a fluctuation. You can double check the
> performance results[1] of the last few days. The performance isn't
> recovered.
>
> Hmm yeah the improvement was a fluctuation and smaller than I remembered
> seeing (maybe I had zoomed into the timeline too much).
>
> > I fixed an issue related to kryo serialization in FLINK-35215. IIUC,
> serializerHeavyString doesn't use the kryo serialization. I try to
> run serializerHeavyString demo locally, and didn't see the
> kryo serialization related code is called.
>
> I don't see it either, but then again I don't see commons-io in the call
> stacks either despite the regression...
>
> I'm continuing to investigate the regression.
>
> On Mon, 27 May 2024 at 20:15, Rui Fan <1996fan...@gmail.com> wrote:
>
>> Thanks Sam for the comment!
>>
>> > It looks like the most recent run of JDK 11 saw a big improvement of the
>> > performance of the test.
>>
>> I guess that improvement is a fluctuation. You can double check the
>> performance results[1] of the last few days. The performance isn't
>> recovered.
>>
>
>
>
>>
>> > That improvement seems related

Re: [VOTE] Apache Flink Kubernetes Operator Release 1.10.0, release candidate #1

2024-10-17 Thread Sam Barker
+1 (non binding)

I've verified:
- The src tarball builds and passes mvn verify (with Java 17),
- the sha512 signatures for both srcs and helm
- The GPG
- Checked all poms are version 1.10.0
- Checked all poms do not contain -SNAPSHOT dependencies
- Verified that the chart and appVersions from the helm tarball are 1.10.0
- Verified the chart points at the appropriate image
- Verified that the the RC repo works as a helm Repo

On Wed, 16 Oct 2024 at 05:18, Őrhidi Mátyás  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 1.10.0
> of Apache Flink Kubernetes Operator,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> **Release Overview**
>
> As an overview, the release consists of the following:
> a) Kubernetes Operator canonical source distribution (including the
> Dockerfile), to be deployed to the release repository at dist.apache.org
> b) Kubernetes Operator Helm Chart to be deployed to the release repository
> at dist.apache.org
> c) Maven artifacts to be deployed to the Maven Central Repository
> d) Docker image to be pushed to dockerhub
>
> **Staging Areas to Review**
>
> The staging areas containing the above mentioned artifacts are as follows,
> for your review:
> * All artifacts for a,b) can be found in the corresponding dev repository
> at dist.apache.org [1]
> * All artifacts for c) can be found at the Apache Nexus Repository [2]
> * The docker image for d) is staged on github [3]
>
> All artifacts are signed with the key 48E78F054AA33CB5 [4]
>
> Other links for your review:
> * JIRA release notes [5]
> * source code tag "release-1.1.0-rc1" [6]
> * PR to update the website Downloads page to include Kubernetes Operator
> links [7]
>
> **Vote Duration**
>
> The voting time will run for at least 72 hours.
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
> **Note on Verification**
>
> You can follow the basic verification guide here[8].
> Note that you don't need to verify everything yourself, but please make
> note of what you have tested together with your +- vote.
>
> Thanks,
> Matyas
>
> [1]
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.10.0-rc1/
> [2] https://repository.apache.org/content/repositories/orgapacheflink-1762
>  [3] ghcr.io/apache/flink-kubernetes-operator:c703255
>  [4]https://dist.apache.org/repos/dist/release/flink/KEYS
> [5]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354833
> [6]
>
> https://github.com/apache/flink-kubernetes-operator/releases/tag/release-1.10.0-rc1
> [7] https://github.com/apache/flink-web/pull/758
>  [8]
>
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
>
>
>
>
>
>
>
> ghcr.io/apache/flink-kubernetes-operator:c703255
>


[jira] [Created] (FLINK-36031) Migrate kubernetes operator metrics from OKHttp Interceptor API to Fabric8 Interceptor API

2024-08-11 Thread Sam Barker (Jira)
Sam Barker created FLINK-36031:
--

 Summary: Migrate kubernetes operator metrics from OKHttp 
Interceptor API to Fabric8 Interceptor API
 Key: FLINK-36031
 URL: https://issues.apache.org/jira/browse/FLINK-36031
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.9.0
Reporter: Sam Barker
 Fix For: kubernetes-operator-1.10.0


I noticed [this 
comment|https://github.com/apache/flink-kubernetes-operator/blob/8479f2267620cf6cb7a5cda981a769b8a0262184/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/KubernetesClientUtils.java#L56-L58]
 when investigating something in the operator code base. So followed through to 
check if the Fabric8 interceptor API was ready. It wasn't quite ready but now 
that this [kubernetes-client 
PR|https://github.com/fabric8io/kubernetes-client/pull/6144] has been merged 
and released as part of Fabric8 kubernetes-client 
[6.13.2|https://github.com/fabric8io/kubernetes-client/issues/6097]. 

This change can now be made in the operator and I believe that is a refactoring 
as it has no externally visible effects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36471) Flink Operator uses out of date Docker actions.

2024-10-10 Thread Sam Barker (Jira)
Sam Barker created FLINK-36471:
--

 Summary: Flink Operator uses out of date Docker actions.
 Key: FLINK-36471
 URL: https://issues.apache.org/jira/browse/FLINK-36471
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Sam Barker
Assignee: Sam Barker
 Fix For: kubernetes-operator-1.10.0


The operator test suite if full of warnings about using out of date github 
actions based on node 12 an example [picked at 
random|https://github.com/apache/flink-kubernetes-operator/actions/runs/10915107524]

The problematic actions are:
* {{actions/checkout@v2}}
* {{actions/setup-java@v2}} 
* {{actions/cache@v3}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36473) Operator metrics tests fail intermittently

2024-10-10 Thread Sam Barker (Jira)
Sam Barker created FLINK-36473:
--

 Summary: Operator metrics tests fail intermittently
 Key: FLINK-36473
 URL: https://issues.apache.org/jira/browse/FLINK-36473
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Sam Barker


{{RestApiMetricsCollectorTest}} & {{KubernetesClientMetricsTest}} fail 
intermittently as the metrics are lazily initialised the assertions can be 
executed before the metic initialisation has been propagated.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36460) Refactor Kubernetes Operator Edge to Edge test setup

2024-10-09 Thread Sam Barker (Jira)
Sam Barker created FLINK-36460:
--

 Summary: Refactor Kubernetes Operator Edge to Edge test setup
 Key: FLINK-36460
 URL: https://issues.apache.org/jira/browse/FLINK-36460
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Sam Barker


The edge to edge operator tests are getting difficult to control and evolve as 
was noticed in FLINK-36332. 

We should try and find clear seems to break the test suite up along to avoid 
running lots of extra tests or creating mutually exclusive include and exclude 
conditions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36392) Varying the Java version in the Operator CI has no effect

2024-09-26 Thread Sam Barker (Jira)
Sam Barker created FLINK-36392:
--

 Summary: Varying the Java version in the Operator CI has no effect
 Key: FLINK-36392
 URL: https://issues.apache.org/jira/browse/FLINK-36392
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Sam Barker


[FLINK-33471|https://issues.apache.org/jira/browse/FLINK-33471] & 
[FLINK-33359|https://issues.apache.org/jira/browse/FLINK-33359] both modified 
the Kubernetes Operator edge to edge test suite to support to support a variety 
of java versions in GitHub CI. However as far as I can tell this well 
intentioned move hasn't actually had the desired effect. As the JVM which 
actually runs the maven build of the operator is executed within the context of 
a Docker container based on 
[temurin-11|https://github.com/apache/flink-kubernetes-operator/blob/d946f3f9f3a7f12098cd82db2545de7c89e220ff/Dockerfile#L19].
 

Therefore all the tests are actually executed by a java operator built and 
running under JDK/JRE 11. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36393) Flink Operator uses out of date GitHub actions.

2024-09-26 Thread Sam Barker (Jira)
Sam Barker created FLINK-36393:
--

 Summary: Flink Operator uses out of date GitHub actions.
 Key: FLINK-36393
 URL: https://issues.apache.org/jira/browse/FLINK-36393
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Sam Barker


The operator test suite if full of warnings about using out of date github 
actions based on node 12 an example [picked at 
random|https://github.com/apache/flink-kubernetes-operator/actions/runs/10915107524]

The problematic actions are:
* {{actions/checkout@v2}}
* {{actions/setup-java@v2}} 
* {{actions/cache@v3}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-36332) Allow the Operator http client to be customised

2024-09-19 Thread Sam Barker (Jira)
Sam Barker created FLINK-36332:
--

 Summary: Allow the Operator http client to be customised
 Key: FLINK-36332
 URL: https://issues.apache.org/jira/browse/FLINK-36332
 Project: Flink
  Issue Type: Improvement
Reporter: Sam Barker


We are looking to produce a build of the Flink Kubernetes operator however for 
internal policy reasons we need to exclude the Kotlin dependencies. 

Kotlin is a transitive dependency of OkHttp and now that  
https://issues.apache.org/jira/browse/FLINK-36031 has been merged OkHttp is 
entirely optional (but a sensible default). The Fabric8 project explicitly 
support supplying alternative http clients (see 
https://github.com/fabric8io/kubernetes-client/blob/main/doc/FAQ.md#what-artifacts-should-my-project-depend-on)
 and the common pattern as demonstrated by the 
[java-operator-sdk|https://github.com/operator-framework/java-operator-sdk/blob/24494cb6342a5c75dff9a6962156ff488ad0c818/pom.xml#L44]
 is to define a property with the name of the client implementation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)