Awesome work guys! And even more thanks for the detailed report...This troubleshooting summary will be undoubtedly useful for all our maven projects!
Best, Flavio On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: > Thanks for the effort, Max and Stephan! Happy to see the green light again. > > On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote: > > Hi all! > > > > After a few weeks of terrible build issues, I am happy to announce that > the > > build works again properly, and we actually get meaningful CI results. > > > > Here is a story in many acts, from builds deep red to bright green joy. > > Kudos to Max, who did most of this troubleshooting. This evening, Max and > > me debugged the final issue and got the build back on track. > > > > ------------------ > > The Journey > > ------------------ > > > > (1) Failsafe Plugin > > > > The Maven Failsafe Build Plugin had a critical bug due to which failed > > tests did not result in a failed build. > > > > That is a pretty bad bug for a plugin whose only task is to run tests and > > fail the build if a test fails. > > > > After we recognized that, we upgraded the Failsafe Plugin. > > > > > > (2) Failsafe Plugin Dependency Issues > > > > After the upgrade, the Failsafe Plugin behaved differently and did not > > interoperate with Dependency Shading any more. > > > > Because of that, we switched to the Surefire Plugin. > > > > > > (3) Fixing all the issues introduced in the meantime > > > > Naturally, a number of test instabilities had been introduced, which > needed > > to be fixed. > > > > > > (4) Yarn Tests and Test Scope Refactoring > > > > In the meantime, a Pull Request was merged that moved the Yarn Tests to > the > > test scope. > > Because the configuration searched for tests in the "main" scope, no Yarn > > tests were executed for a while, until the scope was fixed. > > > > > > (5) Yarn Tests and JMX Metrics > > > > After the Yarn Tests were re-activated, we saw them fail due to warnings > > created by the newly introduced metrics code. We could fix that by > updating > > the metrics code and temporarily not registering JMX beans for all > metrics. > > > > > > (6) Yarn / Surefire Deadlock > > > > Finally, some Yarn tests failed reliably in Maven (though not in the > IDE). > > It turned out that those test a command line interface that interacts > with > > the standard input stream. > > > > The newly deployed Surefire Plugin uses standard input as well, for > > communication with forked JVMs. Since Surefire internally locks the > > standard input stream, the Yarn CLI cannot poll the standard input stream > > without locking up and stalling the tests. > > > > We adjusted the tests and now the build happily builds again. > > > > ----------------- > > Conclusions: > > ----------------- > > > > - CI is terribly crucial It took us weeks with the fallout of having a > > period of unreliably CI. > > > > - Maven could do a better job. A bug as crucial as the one that started > > our problem should not occur in a test plugin like surefire. Also, the > > constant change of semantics and dependency scopes is annoying. The > > semantic changes are subtle, but for a build as complex as Flink, they > make > > a difference. > > > > - File-based communication is rarely a good idea. The bug in the > failsafe > > plugin was caused by improper file-based communication, and some of our > > discovered instabilities as well. > > > > Greetings, > > Stephan > > > > > > PS: Some issues and mysteries remain for us to solve: When we allow our > > metrics subsystem to register JMX beans, we see some tests failing due to > > spontaneous JVM process kills. Whoever has a pointer there, please ping > us! >