Re: [DISCUSS] Using timeouts in JUnit tests

Piotr Nowojski Tue, 04 May 2021 08:55:42 -0700

Thanks for the response Dawid. In this case I don't see a reason for
timeouts in JUnit. I agree with the previously mentioned points and I think
it doesn't make sense to use them in order to limit test duration


Piotrek

wt., 4 maj 2021 o 17:44 Dawid Wysakowicz <dwysakow...@apache.org>
napisał(a):

> Hey all,
>
> Sorry I've not been active in the thread. I think the conclusion is
> rather positive and we do want to depend more on the Azure watchdog and
> discourage timeouts in JUnit tests from now on. I'll update our coding
> guidelines accordingly.
>
> @Piotr Yes, this was a problem in the past, but that was solved in the
> FLINK-21346, that I linked, quite recently. The thread dumps will be
> taken and logs uploaded if the job is reaching the global limit.
> Similarly as it happens for the e2e tests.
>
> Best,
>
> Dawid
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-21346
>
> On 04/05/2021 17:24, Piotr Nowojski wrote:
> > Hi,
> >
> > I'm ok with removing most or even all timeouts. Just one thing.
> >
> > Reason behind using junit timeouts that I've heard (and I was adhering to
> > it) was that maven watchdog was doing the thread dump and killing the
> test
> > process using timeout based on logs inactivity. Some tests were by nature
> > prone to live lock (for example if a job had an infinite source), that
> was
> > preventing the watchdog from kicking in. And if I remember correctly we
> > didn't have a global timeout for all tests, just travis was killing our
> > jobs without even uploading any logs to S3. So junit level timeouts were
> > very valuable in those kinds of cases, to at least get some logs, even if
> > without a stack trace.
> >
> > I have no idea what's the state of our watch dogs right now, after all of
> > the changes in the past years, so I don't know how relevant is this
> > reasoning.
> >
> > Piotrek
> >
> > pt., 30 kwi 2021 o 11:52 Till Rohrmann <trohrm...@apache.org>
> napisał(a):
> >
> >> Yes, you can click on the test job where a test failed. Then you can
> click
> >> on 1 artifact produced (or on the overview page on the X published
> >> (artifacts)). This brings you to the published artifacts page where we
> >> upload for every job the logs.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <lindon...@gmail.com> wrote:
> >>
> >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just
> >>> dumped to a file (the FileAppender) other than the console.
> >>>
> >>> There is probably a way to retrieve that log file from AZP. I will ask
> >>> other colleagues how to get this later.
> >>>
> >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <trohrm...@apache.org>
> >>> wrote:
> >>>
> >>>> I think for the maven tests we use this log4j.properties file [1].
> >>>>
> >>>> [1]
> >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties
> >>>> Cheers,
> >>>> Till
> >>>>
> >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <lindon...@gmail.com> wrote:
> >>>>
> >>>>> Thanks for the detailed explanations! Regarding the usage of timeout,
> >>>> now I
> >>>>> agree that it is better to remove per-test timeouts because it helps
> >>>>> make our testing results more reliable and consistent.
> >>>>>
> >>>>> My previous concern is that it might not be a good idea to
> >>> intentionally
> >>>>> let the test hang in AZP in order to get the thread dump. Now I get
> >>> that
> >>>>> there are a few practical concerns around the usage of timeout which
> >>>> makes
> >>>>> testing results unreliable (e.g. flakiness in the presence of VM
> >>>>> migration).
> >>>>>
> >>>>> Regarding the level logging on AZP, it appears that we actually set
> >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means
> >>> that
> >>>>> no INFO log would be printed on AZP. For example, I tried to increase
> >>> the
> >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR
> >> and
> >>>> was
> >>>>> suggested in this
> >>>>> <
> >>>>>
> >>
> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055
> >>>>> comment to avoid increasing the log level. Did I miss something here?
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org>
> >> wrote:
> >>>>>> Just to add to Dong Lin's list of cons of allowing timeout:
> >>>>>> - Any timeout value that you manually set is arbitrary. If it's set
> >>> too
> >>>>>> low, you get test instabilities. What too low means depends on
> >>> numerous
> >>>>>> factors, such as hardware and current utilization (especially I/O).
> >>> If
> >>>>> you
> >>>>>> run in VMs and the VM is migrated while running a test, any
> >>> reasonable
> >>>>>> timeout will probably fail. While you could make a similar case for
> >>> the
> >>>>>> overall timeout of tests, any smaller hiccup in the range of
> >> minutes
> >>>> will
> >>>>>> not impact the overall runtime much. The probability of having a VM
> >>>>>> constantly migrating during the same stage is abysmally low.
> >>>>>> - A timeout is more maintenance-intensive. It's one more knob where
> >>> you
> >>>>> can
> >>>>>> tweak a build or not. If you change the test a bit, you also need
> >> to
> >>>>>> double-check the timeout. Hence, there have been quite a few
> >> commits
> >>>> that
> >>>>>> just increase timeouts.
> >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some
> >> ITs
> >>>>> have a
> >>>>>> timeout and others don't? All IT tests are prone to timeout if
> >> there
> >>>> are
> >>>>>> issues with resource allocation. Similarly, there are quite a few
> >>> unit
> >>>>>> tests with timeouts while others don't have them with no obvious
> >>>> pattern.
> >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having
> >> a
> >>>>>> release with such a timeout and the users cannot build Flink
> >>> reliably.
> >>>>>> I'd like to also point out that we should not cater around unstable
> >>>> tests
> >>>>>> if our overall goal is to have as many green builds as possible. If
> >>> we
> >>>>>> assume that our builds fail more often than not, we should also
> >> look
> >>>> into
> >>>>>> the other direction and continue the builds on error. I'm not a big
> >>> fan
> >>>>> of
> >>>>>> that.
> >>>>>>
> >>>>>> One argument that I also heard is that it eases local debugging in
> >>> case
> >>>>> of
> >>>>>> refactorings as you can see multiple failures at the same time. But
> >>> no
> >>>>> one
> >>>>>> is keeping you from temporarily adding a timeout on your branch.
> >>> Then,
> >>>> we
> >>>>>> can be sure that the timeout is plausible for your hardware and
> >> avoid
> >>>> all
> >>>>>> above mentioned drawbacks.
> >>>>>>
> >>>>>> @Robert Metzger <rmetz...@apache.org>
> >>>>>>
> >>>>>>> If we had a global limit of 1 minute per test, we would have
> >> caught
> >>>>> this
> >>>>>>> case (and we would encourage people to be careful with CI time).
> >>>>>>>
> >>>>>> There are quite a few tests that run longer, especially on a well
> >>>>> utilized
> >>>>>> build machine. A global limit is even worse than individual limits
> >> as
> >>>>> there
> >>>>>> is no value that fits it all. If you screwed up and 200 tests hang,
> >>>> you'd
> >>>>>> also run into the global timeout anyway. I'm also not sure what
> >> these
> >>>>>> additional hangs bring you except a huge log.
> >>>>>>
> >>>>>> I'm also not sure if it's really better in terms of CI time. For
> >>>> example,
> >>>>>> for UnalignedCheckpointRescaleITCase, we test all known
> >> partitioners
> >>> in
> >>>>> one
> >>>>>> pipeline for correctness. For higher parallelism, that means the
> >> test
> >>>>> runs
> >>>>>> over 1 minute regularly. If there is a global limit, I'd need to
> >>> split
> >>>>> the
> >>>>>> test into smaller chunks, where I'm positive that the sum of the
> >>> chunks
> >>>>>> will be larger than before.
> >>>>>>
> >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you
> >> can
> >>>> also
> >>>>>> retrieve the stacktraces.
> >>>>>> PPS: I also said that we should revalidate the current timeout on
> >>> AZP.
> >>>> So
> >>>>>> the argument that we have >2h of precious CI time wasted is kind of
> >>>>>> constructed and is just due to some random defaults.
> >>>>>>
> >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <
> >> trohrm...@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I think we do capture the INFO logs of the test runs on AZP.
> >>>>>>>
> >>>>>>> I am also not sure whether we really caught slow tests with
> >> Junit's
> >>>>>> timeout
> >>>>>>> rule before. I think the default is usually to increase the
> >> timeout
> >>>> to
> >>>>>> make
> >>>>>>> the test pass. One way to find slow tests is to measure the time
> >>> and
> >>>>> look
> >>>>>>> at the outliers.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Till
> >>>>>>>
> >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com>
> >>>> wrote:
> >>>>>>>> There is one more point that may be useful to consider here.
> >>>>>>>>
> >>>>>>>> In order to debug deadlock that is not easily reproducible, it
> >> is
> >>>>>> likely
> >>>>>>>> not sufficient to see only the thread dump to figure out the
> >> root
> >>>>>> cause.
> >>>>>>> We
> >>>>>>>> likely need to enable the INFO level logging. Since AZP does
> >> not
> >>>>>> provide
> >>>>>>>> INFO level logging by default, we either need to reproduce the
> >>> bug
> >>>>>>> locally
> >>>>>>>> or change the AZP log4j temporarily. This further reduces the
> >>>> benefit
> >>>>>> of
> >>>>>>>> logging the thread dump (which comes at the cost of letting the
> >>> AZP
> >>>>> job
> >>>>>>>> hang).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com>
> >>>>> wrote:
> >>>>>>>>> Just to make sure I understand the proposal correctly: is the
> >>>>>> proposal
> >>>>>>> to
> >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit
> >> tests?
> >>>>>>>>> Here is my understanding of the pros/cons according to the
> >>>>> discussion
> >>>>>>> so
> >>>>>>>>> far.
> >>>>>>>>>
> >>>>>>>>> Pros of allowing timeout:
> >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps
> >> us
> >>>>>>>>> catch those tests and thus increase the quality of unit
> >> tests.
> >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP
> >>> job
> >>>>>> fail
> >>>>>>>>> fast instead of being blocked for 4 hours. This saves
> >> resources
> >>>> and
> >>>>>>> also
> >>>>>>>>> allows developers to get their PR tested again earlier
> >> (useful
> >>>> when
> >>>>>> the
> >>>>>>>>> test failure is not relevant to their PR).
> >>>>>>>>>
> >>>>>>>>> Cons of allowing timeout:
> >>>>>>>>> 1) When there are tests that cause deadlock, we could not see
> >>> the
> >>>>>>> thread
> >>>>>>>>> dump of all threads, which makes debugging the issue harder.
> >>>>>>>>>
> >>>>>>>>> I would suggest that we should still allow timeout because
> >> the
> >>>> pros
> >>>>>>>>> outweigh the cons.
> >>>>>>>>>
> >>>>>>>>> As far as I can tell, if we allow timeout and encounter a
> >>>> deadlock
> >>>>>> bug
> >>>>>>> in
> >>>>>>>>> AZP, we still know which test (or test suite) fails. There
> >> is a
> >>>>> good
> >>>>>>>> chance
> >>>>>>>>> we can reproduce the deadlock locally (by running it 100
> >> times)
> >>>> and
> >>>>>> get
> >>>>>>>> the
> >>>>>>>>> debug information we need. In the rare case where the
> >> deadlock
> >>>>>> happens
> >>>>>>>> only
> >>>>>>>>> on AZP, we can just disable the timeout for that particular
> >>> test.
> >>>>> So
> >>>>>>> the
> >>>>>>>>> lack of thread dump is not really a concern.
> >>>>>>>>>
> >>>>>>>>> On the other hand, if we disallow timeout, it will be very
> >> hard
> >>>> for
> >>>>>> us
> >>>>>>> to
> >>>>>>>>> catch low-quality tests. I don't know if there is a good
> >>>>> alternative
> >>>>>>> way
> >>>>>>>> to
> >>>>>>>>> catch those tests.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <
> >>>>>>> dwysakow...@apache.org
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi devs!
> >>>>>>>>>>
> >>>>>>>>>> I wanted to bring up something that was discussed in a few
> >>>>>> independent
> >>>>>>>>>> groups of people in the past days. I'd like to revise using
> >>>>> timeouts
> >>>>>>> in
> >>>>>>>>>> our JUnit tests. The suggestion would be not to use them
> >>>> anymore.
> >>>>>> The
> >>>>>>>>>> problem with timeouts is that we have no thread dump and
> >> stack
> >>>>>> traces
> >>>>>>> of
> >>>>>>>>>> the system as it hangs. If we were not using a timeout, the
> >> CI
> >>>>>> runner
> >>>>>>>>>> would have caught the timeout and created a thread dump
> >> which
> >>>>> often
> >>>>>>> is a
> >>>>>>>>>> great starting point for debugging.
> >>>>>>>>>>
> >>>>>>>>>> This problem has been spotted e.g. during debugging
> >>>>> FLINK-22416[1].
> >>>>>> In
> >>>>>>>>>> the past thread dumps were not always taken for hanging
> >> tests,
> >>>> but
> >>>>>> it
> >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to
> >>> hear
> >>>>>> your
> >>>>>>>>>> opinions on it. If there are no objections I would like to
> >> add
> >>>> the
> >>>>>>>>>> suggestion to the Coding Guidelines[3]
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>>
> >>>>>>>>>> Dawid
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416
> >>>>>>>>>>
> >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346
> >>>>>>>>>>
> >>>>>>>>>> [3]
> >>>>>>>>>>
> >>>>>>>>>>
> >>
> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries
> >>>>>>>>>>
> >>>>>>>>>>
>
>

Re: [DISCUSS] Using timeouts in JUnit tests

Reply via email to