Thanks for the response Dawid. In this case I don't see a reason for timeouts in JUnit. I agree with the previously mentioned points and I think it doesn't make sense to use them in order to limit test duration
Piotrek wt., 4 maj 2021 o 17:44 Dawid Wysakowicz <dwysakow...@apache.org> napisał(a): > Hey all, > > Sorry I've not been active in the thread. I think the conclusion is > rather positive and we do want to depend more on the Azure watchdog and > discourage timeouts in JUnit tests from now on. I'll update our coding > guidelines accordingly. > > @Piotr Yes, this was a problem in the past, but that was solved in the > FLINK-21346, that I linked, quite recently. The thread dumps will be > taken and logs uploaded if the job is reaching the global limit. > Similarly as it happens for the e2e tests. > > Best, > > Dawid > > > [1] https://issues.apache.org/jira/browse/FLINK-21346 > > On 04/05/2021 17:24, Piotr Nowojski wrote: > > Hi, > > > > I'm ok with removing most or even all timeouts. Just one thing. > > > > Reason behind using junit timeouts that I've heard (and I was adhering to > > it) was that maven watchdog was doing the thread dump and killing the > test > > process using timeout based on logs inactivity. Some tests were by nature > > prone to live lock (for example if a job had an infinite source), that > was > > preventing the watchdog from kicking in. And if I remember correctly we > > didn't have a global timeout for all tests, just travis was killing our > > jobs without even uploading any logs to S3. So junit level timeouts were > > very valuable in those kinds of cases, to at least get some logs, even if > > without a stack trace. > > > > I have no idea what's the state of our watch dogs right now, after all of > > the changes in the past years, so I don't know how relevant is this > > reasoning. > > > > Piotrek > > > > pt., 30 kwi 2021 o 11:52 Till Rohrmann <trohrm...@apache.org> > napisał(a): > > > >> Yes, you can click on the test job where a test failed. Then you can > click > >> on 1 artifact produced (or on the overview page on the X published > >> (artifacts)). This brings you to the published artifacts page where we > >> upload for every job the logs. > >> > >> Cheers, > >> Till > >> > >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <lindon...@gmail.com> wrote: > >> > >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just > >>> dumped to a file (the FileAppender) other than the console. > >>> > >>> There is probably a way to retrieve that log file from AZP. I will ask > >>> other colleagues how to get this later. > >>> > >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <trohrm...@apache.org> > >>> wrote: > >>> > >>>> I think for the maven tests we use this log4j.properties file [1]. > >>>> > >>>> [1] > >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties > >>>> Cheers, > >>>> Till > >>>> > >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <lindon...@gmail.com> wrote: > >>>> > >>>>> Thanks for the detailed explanations! Regarding the usage of timeout, > >>>> now I > >>>>> agree that it is better to remove per-test timeouts because it helps > >>>>> make our testing results more reliable and consistent. > >>>>> > >>>>> My previous concern is that it might not be a good idea to > >>> intentionally > >>>>> let the test hang in AZP in order to get the thread dump. Now I get > >>> that > >>>>> there are a few practical concerns around the usage of timeout which > >>>> makes > >>>>> testing results unreliable (e.g. flakiness in the presence of VM > >>>>> migration). > >>>>> > >>>>> Regarding the level logging on AZP, it appears that we actually set > >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means > >>> that > >>>>> no INFO log would be printed on AZP. For example, I tried to increase > >>> the > >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR > >> and > >>>> was > >>>>> suggested in this > >>>>> < > >>>>> > >> > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > >>>>> comment to avoid increasing the log level. Did I miss something here? > >>>>> > >>>>> > >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> > >> wrote: > >>>>>> Just to add to Dong Lin's list of cons of allowing timeout: > >>>>>> - Any timeout value that you manually set is arbitrary. If it's set > >>> too > >>>>>> low, you get test instabilities. What too low means depends on > >>> numerous > >>>>>> factors, such as hardware and current utilization (especially I/O). > >>> If > >>>>> you > >>>>>> run in VMs and the VM is migrated while running a test, any > >>> reasonable > >>>>>> timeout will probably fail. While you could make a similar case for > >>> the > >>>>>> overall timeout of tests, any smaller hiccup in the range of > >> minutes > >>>> will > >>>>>> not impact the overall runtime much. The probability of having a VM > >>>>>> constantly migrating during the same stage is abysmally low. > >>>>>> - A timeout is more maintenance-intensive. It's one more knob where > >>> you > >>>>> can > >>>>>> tweak a build or not. If you change the test a bit, you also need > >> to > >>>>>> double-check the timeout. Hence, there have been quite a few > >> commits > >>>> that > >>>>>> just increase timeouts. > >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some > >> ITs > >>>>> have a > >>>>>> timeout and others don't? All IT tests are prone to timeout if > >> there > >>>> are > >>>>>> issues with resource allocation. Similarly, there are quite a few > >>> unit > >>>>>> tests with timeouts while others don't have them with no obvious > >>>> pattern. > >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having > >> a > >>>>>> release with such a timeout and the users cannot build Flink > >>> reliably. > >>>>>> I'd like to also point out that we should not cater around unstable > >>>> tests > >>>>>> if our overall goal is to have as many green builds as possible. If > >>> we > >>>>>> assume that our builds fail more often than not, we should also > >> look > >>>> into > >>>>>> the other direction and continue the builds on error. I'm not a big > >>> fan > >>>>> of > >>>>>> that. > >>>>>> > >>>>>> One argument that I also heard is that it eases local debugging in > >>> case > >>>>> of > >>>>>> refactorings as you can see multiple failures at the same time. But > >>> no > >>>>> one > >>>>>> is keeping you from temporarily adding a timeout on your branch. > >>> Then, > >>>> we > >>>>>> can be sure that the timeout is plausible for your hardware and > >> avoid > >>>> all > >>>>>> above mentioned drawbacks. > >>>>>> > >>>>>> @Robert Metzger <rmetz...@apache.org> > >>>>>> > >>>>>>> If we had a global limit of 1 minute per test, we would have > >> caught > >>>>> this > >>>>>>> case (and we would encourage people to be careful with CI time). > >>>>>>> > >>>>>> There are quite a few tests that run longer, especially on a well > >>>>> utilized > >>>>>> build machine. A global limit is even worse than individual limits > >> as > >>>>> there > >>>>>> is no value that fits it all. If you screwed up and 200 tests hang, > >>>> you'd > >>>>>> also run into the global timeout anyway. I'm also not sure what > >> these > >>>>>> additional hangs bring you except a huge log. > >>>>>> > >>>>>> I'm also not sure if it's really better in terms of CI time. For > >>>> example, > >>>>>> for UnalignedCheckpointRescaleITCase, we test all known > >> partitioners > >>> in > >>>>> one > >>>>>> pipeline for correctness. For higher parallelism, that means the > >> test > >>>>> runs > >>>>>> over 1 minute regularly. If there is a global limit, I'd need to > >>> split > >>>>> the > >>>>>> test into smaller chunks, where I'm positive that the sum of the > >>> chunks > >>>>>> will be larger than before. > >>>>>> > >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you > >> can > >>>> also > >>>>>> retrieve the stacktraces. > >>>>>> PPS: I also said that we should revalidate the current timeout on > >>> AZP. > >>>> So > >>>>>> the argument that we have >2h of precious CI time wasted is kind of > >>>>>> constructed and is just due to some random defaults. > >>>>>> > >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann < > >> trohrm...@apache.org> > >>>>>> wrote: > >>>>>> > >>>>>>> I think we do capture the INFO logs of the test runs on AZP. > >>>>>>> > >>>>>>> I am also not sure whether we really caught slow tests with > >> Junit's > >>>>>> timeout > >>>>>>> rule before. I think the default is usually to increase the > >> timeout > >>>> to > >>>>>> make > >>>>>>> the test pass. One way to find slow tests is to measure the time > >>> and > >>>>> look > >>>>>>> at the outliers. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Till > >>>>>>> > >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com> > >>>> wrote: > >>>>>>>> There is one more point that may be useful to consider here. > >>>>>>>> > >>>>>>>> In order to debug deadlock that is not easily reproducible, it > >> is > >>>>>> likely > >>>>>>>> not sufficient to see only the thread dump to figure out the > >> root > >>>>>> cause. > >>>>>>> We > >>>>>>>> likely need to enable the INFO level logging. Since AZP does > >> not > >>>>>> provide > >>>>>>>> INFO level logging by default, we either need to reproduce the > >>> bug > >>>>>>> locally > >>>>>>>> or change the AZP log4j temporarily. This further reduces the > >>>> benefit > >>>>>> of > >>>>>>>> logging the thread dump (which comes at the cost of letting the > >>> AZP > >>>>> job > >>>>>>>> hang). > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com> > >>>>> wrote: > >>>>>>>>> Just to make sure I understand the proposal correctly: is the > >>>>>> proposal > >>>>>>> to > >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit > >> tests? > >>>>>>>>> Here is my understanding of the pros/cons according to the > >>>>> discussion > >>>>>>> so > >>>>>>>>> far. > >>>>>>>>> > >>>>>>>>> Pros of allowing timeout: > >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps > >> us > >>>>>>>>> catch those tests and thus increase the quality of unit > >> tests. > >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP > >>> job > >>>>>> fail > >>>>>>>>> fast instead of being blocked for 4 hours. This saves > >> resources > >>>> and > >>>>>>> also > >>>>>>>>> allows developers to get their PR tested again earlier > >> (useful > >>>> when > >>>>>> the > >>>>>>>>> test failure is not relevant to their PR). > >>>>>>>>> > >>>>>>>>> Cons of allowing timeout: > >>>>>>>>> 1) When there are tests that cause deadlock, we could not see > >>> the > >>>>>>> thread > >>>>>>>>> dump of all threads, which makes debugging the issue harder. > >>>>>>>>> > >>>>>>>>> I would suggest that we should still allow timeout because > >> the > >>>> pros > >>>>>>>>> outweigh the cons. > >>>>>>>>> > >>>>>>>>> As far as I can tell, if we allow timeout and encounter a > >>>> deadlock > >>>>>> bug > >>>>>>> in > >>>>>>>>> AZP, we still know which test (or test suite) fails. There > >> is a > >>>>> good > >>>>>>>> chance > >>>>>>>>> we can reproduce the deadlock locally (by running it 100 > >> times) > >>>> and > >>>>>> get > >>>>>>>> the > >>>>>>>>> debug information we need. In the rare case where the > >> deadlock > >>>>>> happens > >>>>>>>> only > >>>>>>>>> on AZP, we can just disable the timeout for that particular > >>> test. > >>>>> So > >>>>>>> the > >>>>>>>>> lack of thread dump is not really a concern. > >>>>>>>>> > >>>>>>>>> On the other hand, if we disallow timeout, it will be very > >> hard > >>>> for > >>>>>> us > >>>>>>> to > >>>>>>>>> catch low-quality tests. I don't know if there is a good > >>>>> alternative > >>>>>>> way > >>>>>>>> to > >>>>>>>>> catch those tests. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > >>>>>>> dwysakow...@apache.org > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi devs! > >>>>>>>>>> > >>>>>>>>>> I wanted to bring up something that was discussed in a few > >>>>>> independent > >>>>>>>>>> groups of people in the past days. I'd like to revise using > >>>>> timeouts > >>>>>>> in > >>>>>>>>>> our JUnit tests. The suggestion would be not to use them > >>>> anymore. > >>>>>> The > >>>>>>>>>> problem with timeouts is that we have no thread dump and > >> stack > >>>>>> traces > >>>>>>> of > >>>>>>>>>> the system as it hangs. If we were not using a timeout, the > >> CI > >>>>>> runner > >>>>>>>>>> would have caught the timeout and created a thread dump > >> which > >>>>> often > >>>>>>> is a > >>>>>>>>>> great starting point for debugging. > >>>>>>>>>> > >>>>>>>>>> This problem has been spotted e.g. during debugging > >>>>> FLINK-22416[1]. > >>>>>> In > >>>>>>>>>> the past thread dumps were not always taken for hanging > >> tests, > >>>> but > >>>>>> it > >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to > >>> hear > >>>>>> your > >>>>>>>>>> opinions on it. If there are no objections I would like to > >> add > >>>> the > >>>>>>>>>> suggestion to the Coding Guidelines[3] > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> > >>>>>>>>>> Dawid > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416 > >>>>>>>>>> > >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346 > >>>>>>>>>> > >>>>>>>>>> [3] > >>>>>>>>>> > >>>>>>>>>> > >> > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > >>>>>>>>>> > >>>>>>>>>> > >