Hey all, Sorry I've not been active in the thread. I think the conclusion is rather positive and we do want to depend more on the Azure watchdog and discourage timeouts in JUnit tests from now on. I'll update our coding guidelines accordingly.
@Piotr Yes, this was a problem in the past, but that was solved in the FLINK-21346, that I linked, quite recently. The thread dumps will be taken and logs uploaded if the job is reaching the global limit. Similarly as it happens for the e2e tests. Best, Dawid [1] https://issues.apache.org/jira/browse/FLINK-21346 On 04/05/2021 17:24, Piotr Nowojski wrote: > Hi, > > I'm ok with removing most or even all timeouts. Just one thing. > > Reason behind using junit timeouts that I've heard (and I was adhering to > it) was that maven watchdog was doing the thread dump and killing the test > process using timeout based on logs inactivity. Some tests were by nature > prone to live lock (for example if a job had an infinite source), that was > preventing the watchdog from kicking in. And if I remember correctly we > didn't have a global timeout for all tests, just travis was killing our > jobs without even uploading any logs to S3. So junit level timeouts were > very valuable in those kinds of cases, to at least get some logs, even if > without a stack trace. > > I have no idea what's the state of our watch dogs right now, after all of > the changes in the past years, so I don't know how relevant is this > reasoning. > > Piotrek > > pt., 30 kwi 2021 o 11:52 Till Rohrmann <trohrm...@apache.org> napisaĆ(a): > >> Yes, you can click on the test job where a test failed. Then you can click >> on 1 artifact produced (or on the overview page on the X published >> (artifacts)). This brings you to the published artifacts page where we >> upload for every job the logs. >> >> Cheers, >> Till >> >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <lindon...@gmail.com> wrote: >> >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just >>> dumped to a file (the FileAppender) other than the console. >>> >>> There is probably a way to retrieve that log file from AZP. I will ask >>> other colleagues how to get this later. >>> >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> >>>> I think for the maven tests we use this log4j.properties file [1]. >>>> >>>> [1] >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties >>>> Cheers, >>>> Till >>>> >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <lindon...@gmail.com> wrote: >>>> >>>>> Thanks for the detailed explanations! Regarding the usage of timeout, >>>> now I >>>>> agree that it is better to remove per-test timeouts because it helps >>>>> make our testing results more reliable and consistent. >>>>> >>>>> My previous concern is that it might not be a good idea to >>> intentionally >>>>> let the test hang in AZP in order to get the thread dump. Now I get >>> that >>>>> there are a few practical concerns around the usage of timeout which >>>> makes >>>>> testing results unreliable (e.g. flakiness in the presence of VM >>>>> migration). >>>>> >>>>> Regarding the level logging on AZP, it appears that we actually set >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means >>> that >>>>> no INFO log would be printed on AZP. For example, I tried to increase >>> the >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR >> and >>>> was >>>>> suggested in this >>>>> < >>>>> >> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 >>>>> comment to avoid increasing the log level. Did I miss something here? >>>>> >>>>> >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <ar...@apache.org> >> wrote: >>>>>> Just to add to Dong Lin's list of cons of allowing timeout: >>>>>> - Any timeout value that you manually set is arbitrary. If it's set >>> too >>>>>> low, you get test instabilities. What too low means depends on >>> numerous >>>>>> factors, such as hardware and current utilization (especially I/O). >>> If >>>>> you >>>>>> run in VMs and the VM is migrated while running a test, any >>> reasonable >>>>>> timeout will probably fail. While you could make a similar case for >>> the >>>>>> overall timeout of tests, any smaller hiccup in the range of >> minutes >>>> will >>>>>> not impact the overall runtime much. The probability of having a VM >>>>>> constantly migrating during the same stage is abysmally low. >>>>>> - A timeout is more maintenance-intensive. It's one more knob where >>> you >>>>> can >>>>>> tweak a build or not. If you change the test a bit, you also need >> to >>>>>> double-check the timeout. Hence, there have been quite a few >> commits >>>> that >>>>>> just increase timeouts. >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some >> ITs >>>>> have a >>>>>> timeout and others don't? All IT tests are prone to timeout if >> there >>>> are >>>>>> issues with resource allocation. Similarly, there are quite a few >>> unit >>>>>> tests with timeouts while others don't have them with no obvious >>>> pattern. >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having >> a >>>>>> release with such a timeout and the users cannot build Flink >>> reliably. >>>>>> I'd like to also point out that we should not cater around unstable >>>> tests >>>>>> if our overall goal is to have as many green builds as possible. If >>> we >>>>>> assume that our builds fail more often than not, we should also >> look >>>> into >>>>>> the other direction and continue the builds on error. I'm not a big >>> fan >>>>> of >>>>>> that. >>>>>> >>>>>> One argument that I also heard is that it eases local debugging in >>> case >>>>> of >>>>>> refactorings as you can see multiple failures at the same time. But >>> no >>>>> one >>>>>> is keeping you from temporarily adding a timeout on your branch. >>> Then, >>>> we >>>>>> can be sure that the timeout is plausible for your hardware and >> avoid >>>> all >>>>>> above mentioned drawbacks. >>>>>> >>>>>> @Robert Metzger <rmetz...@apache.org> >>>>>> >>>>>>> If we had a global limit of 1 minute per test, we would have >> caught >>>>> this >>>>>>> case (and we would encourage people to be careful with CI time). >>>>>>> >>>>>> There are quite a few tests that run longer, especially on a well >>>>> utilized >>>>>> build machine. A global limit is even worse than individual limits >> as >>>>> there >>>>>> is no value that fits it all. If you screwed up and 200 tests hang, >>>> you'd >>>>>> also run into the global timeout anyway. I'm also not sure what >> these >>>>>> additional hangs bring you except a huge log. >>>>>> >>>>>> I'm also not sure if it's really better in terms of CI time. For >>>> example, >>>>>> for UnalignedCheckpointRescaleITCase, we test all known >> partitioners >>> in >>>>> one >>>>>> pipeline for correctness. For higher parallelism, that means the >> test >>>>> runs >>>>>> over 1 minute regularly. If there is a global limit, I'd need to >>> split >>>>> the >>>>>> test into smaller chunks, where I'm positive that the sum of the >>> chunks >>>>>> will be larger than before. >>>>>> >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you >> can >>>> also >>>>>> retrieve the stacktraces. >>>>>> PPS: I also said that we should revalidate the current timeout on >>> AZP. >>>> So >>>>>> the argument that we have >2h of precious CI time wasted is kind of >>>>>> constructed and is just due to some random defaults. >>>>>> >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann < >> trohrm...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I think we do capture the INFO logs of the test runs on AZP. >>>>>>> >>>>>>> I am also not sure whether we really caught slow tests with >> Junit's >>>>>> timeout >>>>>>> rule before. I think the default is usually to increase the >> timeout >>>> to >>>>>> make >>>>>>> the test pass. One way to find slow tests is to measure the time >>> and >>>>> look >>>>>>> at the outliers. >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com> >>>> wrote: >>>>>>>> There is one more point that may be useful to consider here. >>>>>>>> >>>>>>>> In order to debug deadlock that is not easily reproducible, it >> is >>>>>> likely >>>>>>>> not sufficient to see only the thread dump to figure out the >> root >>>>>> cause. >>>>>>> We >>>>>>>> likely need to enable the INFO level logging. Since AZP does >> not >>>>>> provide >>>>>>>> INFO level logging by default, we either need to reproduce the >>> bug >>>>>>> locally >>>>>>>> or change the AZP log4j temporarily. This further reduces the >>>> benefit >>>>>> of >>>>>>>> logging the thread dump (which comes at the cost of letting the >>> AZP >>>>> job >>>>>>>> hang). >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com> >>>>> wrote: >>>>>>>>> Just to make sure I understand the proposal correctly: is the >>>>>> proposal >>>>>>> to >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit >> tests? >>>>>>>>> Here is my understanding of the pros/cons according to the >>>>> discussion >>>>>>> so >>>>>>>>> far. >>>>>>>>> >>>>>>>>> Pros of allowing timeout: >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps >> us >>>>>>>>> catch those tests and thus increase the quality of unit >> tests. >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP >>> job >>>>>> fail >>>>>>>>> fast instead of being blocked for 4 hours. This saves >> resources >>>> and >>>>>>> also >>>>>>>>> allows developers to get their PR tested again earlier >> (useful >>>> when >>>>>> the >>>>>>>>> test failure is not relevant to their PR). >>>>>>>>> >>>>>>>>> Cons of allowing timeout: >>>>>>>>> 1) When there are tests that cause deadlock, we could not see >>> the >>>>>>> thread >>>>>>>>> dump of all threads, which makes debugging the issue harder. >>>>>>>>> >>>>>>>>> I would suggest that we should still allow timeout because >> the >>>> pros >>>>>>>>> outweigh the cons. >>>>>>>>> >>>>>>>>> As far as I can tell, if we allow timeout and encounter a >>>> deadlock >>>>>> bug >>>>>>> in >>>>>>>>> AZP, we still know which test (or test suite) fails. There >> is a >>>>> good >>>>>>>> chance >>>>>>>>> we can reproduce the deadlock locally (by running it 100 >> times) >>>> and >>>>>> get >>>>>>>> the >>>>>>>>> debug information we need. In the rare case where the >> deadlock >>>>>> happens >>>>>>>> only >>>>>>>>> on AZP, we can just disable the timeout for that particular >>> test. >>>>> So >>>>>>> the >>>>>>>>> lack of thread dump is not really a concern. >>>>>>>>> >>>>>>>>> On the other hand, if we disallow timeout, it will be very >> hard >>>> for >>>>>> us >>>>>>> to >>>>>>>>> catch low-quality tests. I don't know if there is a good >>>>> alternative >>>>>>> way >>>>>>>> to >>>>>>>>> catch those tests. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < >>>>>>> dwysakow...@apache.org >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi devs! >>>>>>>>>> >>>>>>>>>> I wanted to bring up something that was discussed in a few >>>>>> independent >>>>>>>>>> groups of people in the past days. I'd like to revise using >>>>> timeouts >>>>>>> in >>>>>>>>>> our JUnit tests. The suggestion would be not to use them >>>> anymore. >>>>>> The >>>>>>>>>> problem with timeouts is that we have no thread dump and >> stack >>>>>> traces >>>>>>> of >>>>>>>>>> the system as it hangs. If we were not using a timeout, the >> CI >>>>>> runner >>>>>>>>>> would have caught the timeout and created a thread dump >> which >>>>> often >>>>>>> is a >>>>>>>>>> great starting point for debugging. >>>>>>>>>> >>>>>>>>>> This problem has been spotted e.g. during debugging >>>>> FLINK-22416[1]. >>>>>> In >>>>>>>>>> the past thread dumps were not always taken for hanging >> tests, >>>> but >>>>>> it >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to >>> hear >>>>>> your >>>>>>>>>> opinions on it. If there are no objections I would like to >> add >>>> the >>>>>>>>>> suggestion to the Coding Guidelines[3] >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Dawid >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416 >>>>>>>>>> >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346 >>>>>>>>>> >>>>>>>>>> [3] >>>>>>>>>> >>>>>>>>>> >> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries >>>>>>>>>> >>>>>>>>>>
OpenPGP_signature
Description: OpenPGP digital signature