I think we do capture the INFO logs of the test runs on AZP. I am also not sure whether we really caught slow tests with Junit's timeout rule before. I think the default is usually to increase the timeout to make the test pass. One way to find slow tests is to measure the time and look at the outliers.
Cheers, Till On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <lindon...@gmail.com> wrote: > There is one more point that may be useful to consider here. > > In order to debug deadlock that is not easily reproducible, it is likely > not sufficient to see only the thread dump to figure out the root cause. We > likely need to enable the INFO level logging. Since AZP does not provide > INFO level logging by default, we either need to reproduce the bug locally > or change the AZP log4j temporarily. This further reduces the benefit of > logging the thread dump (which comes at the cost of letting the AZP job > hang). > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <lindon...@gmail.com> wrote: > > > Just to make sure I understand the proposal correctly: is the proposal to > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > Here is my understanding of the pros/cons according to the discussion so > > far. > > > > Pros of allowing timeout: > > 1) When there are tests that are unreasonably slow, it helps us > > catch those tests and thus increase the quality of unit tests. > > 2) When there are tests that cause deadlock, it helps the AZP job fail > > fast instead of being blocked for 4 hours. This saves resources and also > > allows developers to get their PR tested again earlier (useful when the > > test failure is not relevant to their PR). > > > > Cons of allowing timeout: > > 1) When there are tests that cause deadlock, we could not see the thread > > dump of all threads, which makes debugging the issue harder. > > > > I would suggest that we should still allow timeout because the pros > > outweigh the cons. > > > > As far as I can tell, if we allow timeout and encounter a deadlock bug in > > AZP, we still know which test (or test suite) fails. There is a good > chance > > we can reproduce the deadlock locally (by running it 100 times) and get > the > > debug information we need. In the rare case where the deadlock happens > only > > on AZP, we can just disable the timeout for that particular test. So the > > lack of thread dump is not really a concern. > > > > On the other hand, if we disallow timeout, it will be very hard for us to > > catch low-quality tests. I don't know if there is a good alternative way > to > > catch those tests. > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <dwysakow...@apache.org > > > > wrote: > > > >> Hi devs! > >> > >> I wanted to bring up something that was discussed in a few independent > >> groups of people in the past days. I'd like to revise using timeouts in > >> our JUnit tests. The suggestion would be not to use them anymore. The > >> problem with timeouts is that we have no thread dump and stack traces of > >> the system as it hangs. If we were not using a timeout, the CI runner > >> would have caught the timeout and created a thread dump which often is a > >> great starting point for debugging. > >> > >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In > >> the past thread dumps were not always taken for hanging tests, but it > >> was changed quite recently in FLINK-21346[2]. I am happy to hear your > >> opinions on it. If there are no objections I would like to add the > >> suggestion to the Coding Guidelines[3] > >> > >> Best, > >> > >> Dawid > >> > >> > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > >> > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > >> > >> [3] > >> > >> > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > >> > >> > >> >