Re: Slow Tests in Flink 1.15

Maciek Próchniak Fri, 09 Sep 2022 04:34:07 -0700

Hi,

we also had similar problems in Nussknacker recently (tests on fakesources), my colleague found out it's due toENABLE_CHECKPOINTS_AFTER_FINISH flag(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#waiting-for-the-final-checkpoint-before-task-exit)is set by default to true in 1.15. After the fake source ends, the jobwaits for next checkpoint to be triggered before finishing. In the endwe reduced checkpoint interval in some places and disabled checkpointsaltogether in some other tests


maciek

On 09.09.2022 10:44, David Jost wrote:

Hey, sorry for not coming back to this earlier, but I was hoping to better 
isolate the problem for analysis.

Maybe for comparison to Alexey's case: We have four different pipelines at 
ours, which are all built similarly. Though we use Kafka in the actual jobs, 
the tests are using fake sources and sinks. Only the tests which use the 
MiniClusterWithClientResource are affected (we have one badly written test, 
which runs the pipeline without MiniCluster and it is not affected).

I analysed it a bit and identified, that the job would hang for about 30s after 
the last event has been pushed out the sink. Summing up these 30s for each 
(job) test results in the additional time used by all tests in the end.
So I would assume, that there is some kind of wind-down timer or so, which 
holds up everything.

I hope, this is helpful somehow. I would love to find the source of this issue. 
I was hoping to isolate the issue in an MWE, but have been unsuccessful for now.

NB: I tested the tests with both, MiniClusterWithClientResource (with 
adjustments for JUnit 5) and MiniClusterExtension, but there was no noticeable 
difference.

On 7. Sep 2022, at 14:41, Alexey Trenikhun <yen...@msn.com> wrote:

The class contains single test method, which runs single job (the job is quite 
complex), then waits for job being running after that waits for data being 
populated in output topic, and this doesn't happen during 5 minutes (test 
timeout). Tried under debugger, set breakpoint in Kafka record deserializer it 
is hit but very slow, roughly 3 records per 5 minute (the topic was 
pre-populated)

No table/sql API, only stream API
From: Chesnay Schepler <ches...@apache.org>
Sent: Wednesday, September 7, 2022 5:20 AM
To: Alexey Trenikhun <yen...@msn.com>; David Jost <david.j...@uniberg.com>; Matthias 
Pohl <matthias.p...@aiven.io>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15

The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?

Are these tests using the table/sql API?

On 07/09/2022 14:15, Alexey Trenikhun wrote:

We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2 
integration tests . Both tests use Kafka. The slow test uses 
org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests 
complete job, which consumes and produces Kafka messages. Not affected test 
extends org.apache.flink.test.util.AbstractTestBase which uses 
MiniClusterWithClientResource, this test is simpler and only produce Kafka 
messages.

Thanks,
Alexey
From: Matthias Pohl via user <user@flink.apache.org>
Sent: Tuesday, September 6, 2022 6:36 AM
To: David Jost <david.j...@uniberg.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15

Hi David,

I guess, you're referring to [1]. But as Chesnay already pointed out in the 
previous thread: It would be helpful to get more insights into what exactly 
your tests are executing (logs, code, ...). That would help identifying the 
cause.

Can you give us a more complete stacktrace so we can see what call in
Flink is waiting for something?

Does this happen to all of your tests?
Can you provide us with an example that we can try ourselves? If not,
can you describe the test structure (e.g., is it using a
MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost <david.j...@uniberg.com> wrote:
Hi,

we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2, 
when we noticed, that all our job tests, using a MiniClusterWithClientResource, 
are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have 
not found mentions in that regard in the changelog or documentation. The 
slowdown is rather extreme I hope to find a solution to this. I saw it 
mentioned once in the mailing list, but there was no (public) outcome to it.

I would appreciate any help on this. Thank you in advance.

Best
  David

Re: Slow Tests in Flink 1.15

Reply via email to