Hi,
we also had similar problems in Nussknacker recently (tests on fake
sources), my colleague found out it's due to
ENABLE_CHECKPOINTS_AFTER_FINISH flag
(https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#waiting-for-the-final-checkpoint-before-task-exit)
is set by default to true in 1.15. After the fake source ends, the job
waits for next checkpoint to be triggered before finishing. In the end
we reduced checkpoint interval in some places and disabled checkpoints
altogether in some other tests
maciek
On 09.09.2022 10:44, David Jost wrote:
Hey, sorry for not coming back to this earlier, but I was hoping to better
isolate the problem for analysis.
Maybe for comparison to Alexey's case: We have four different pipelines at
ours, which are all built similarly. Though we use Kafka in the actual jobs,
the tests are using fake sources and sinks. Only the tests which use the
MiniClusterWithClientResource are affected (we have one badly written test,
which runs the pipeline without MiniCluster and it is not affected).
I analysed it a bit and identified, that the job would hang for about 30s after
the last event has been pushed out the sink. Summing up these 30s for each
(job) test results in the additional time used by all tests in the end.
So I would assume, that there is some kind of wind-down timer or so, which
holds up everything.
I hope, this is helpful somehow. I would love to find the source of this issue.
I was hoping to isolate the issue in an MWE, but have been unsuccessful for now.
NB: I tested the tests with both, MiniClusterWithClientResource (with
adjustments for JUnit 5) and MiniClusterExtension, but there was no noticeable
difference.
On 7. Sep 2022, at 14:41, Alexey Trenikhun <yen...@msn.com> wrote:
The class contains single test method, which runs single job (the job is quite
complex), then waits for job being running after that waits for data being
populated in output topic, and this doesn't happen during 5 minutes (test
timeout). Tried under debugger, set breakpoint in Kafka record deserializer it
is hit but very slow, roughly 3 records per 5 minute (the topic was
pre-populated)
No table/sql API, only stream API
From: Chesnay Schepler <ches...@apache.org>
Sent: Wednesday, September 7, 2022 5:20 AM
To: Alexey Trenikhun <yen...@msn.com>; David Jost <david.j...@uniberg.com>; Matthias
Pohl <matthias.p...@aiven.io>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15
The test that gotten slow; how many test cases does it actually contain / how many jobs does it actually run?
Are these tests using the table/sql API?
On 07/09/2022 14:15, Alexey Trenikhun wrote:
We are also observing extreme slow down (5+ minutes vs 15 seconds) in 1 of 2
integration tests . Both tests use Kafka. The slow test uses
org.apache.flink.runtime.minicluster.TestingMiniCluster, this test tests
complete job, which consumes and produces Kafka messages. Not affected test
extends org.apache.flink.test.util.AbstractTestBase which uses
MiniClusterWithClientResource, this test is simpler and only produce Kafka
messages.
Thanks,
Alexey
From: Matthias Pohl via user <user@flink.apache.org>
Sent: Tuesday, September 6, 2022 6:36 AM
To: David Jost <david.j...@uniberg.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Slow Tests in Flink 1.15
Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the
previous thread: It would be helpful to get more insights into what exactly
your tests are executing (logs, code, ...). That would help identifying the
cause.
Can you give us a more complete stacktrace so we can see what call in
Flink is waiting for something?
Does this happen to all of your tests?
Can you provide us with an example that we can try ourselves? If not,
can you describe the test structure (e.g., is it using a
MiniClusterResource).
Matthias
[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk
On Mon, Sep 5, 2022 at 4:59 PM David Jost <david.j...@uniberg.com> wrote:
Hi,
we were going to upgrade our application from Flink 1.14.4 to Flink 1.15.2,
when we noticed, that all our job tests, using a MiniClusterWithClientResource,
are multiple times slower in 1.15 than before in 1.14. I, unfortunately, have
not found mentions in that regard in the changelog or documentation. The
slowdown is rather extreme I hope to find a solution to this. I saw it
mentioned once in the mailing list, but there was no (public) outcome to it.
I would appreciate any help on this. Thank you in advance.
Best
David