Hi all! After a few weeks of terrible build issues, I am happy to announce that the build works again properly, and we actually get meaningful CI results.
Here is a story in many acts, from builds deep red to bright green joy. Kudos to Max, who did most of this troubleshooting. This evening, Max and me debugged the final issue and got the build back on track. ------------------ The Journey ------------------ (1) Failsafe Plugin The Maven Failsafe Build Plugin had a critical bug due to which failed tests did not result in a failed build. That is a pretty bad bug for a plugin whose only task is to run tests and fail the build if a test fails. After we recognized that, we upgraded the Failsafe Plugin. (2) Failsafe Plugin Dependency Issues After the upgrade, the Failsafe Plugin behaved differently and did not interoperate with Dependency Shading any more. Because of that, we switched to the Surefire Plugin. (3) Fixing all the issues introduced in the meantime Naturally, a number of test instabilities had been introduced, which needed to be fixed. (4) Yarn Tests and Test Scope Refactoring In the meantime, a Pull Request was merged that moved the Yarn Tests to the test scope. Because the configuration searched for tests in the "main" scope, no Yarn tests were executed for a while, until the scope was fixed. (5) Yarn Tests and JMX Metrics After the Yarn Tests were re-activated, we saw them fail due to warnings created by the newly introduced metrics code. We could fix that by updating the metrics code and temporarily not registering JMX beans for all metrics. (6) Yarn / Surefire Deadlock Finally, some Yarn tests failed reliably in Maven (though not in the IDE). It turned out that those test a command line interface that interacts with the standard input stream. The newly deployed Surefire Plugin uses standard input as well, for communication with forked JVMs. Since Surefire internally locks the standard input stream, the Yarn CLI cannot poll the standard input stream without locking up and stalling the tests. We adjusted the tests and now the build happily builds again. ----------------- Conclusions: ----------------- - CI is terribly crucial It took us weeks with the fallout of having a period of unreliably CI. - Maven could do a better job. A bug as crucial as the one that started our problem should not occur in a test plugin like surefire. Also, the constant change of semantics and dependency scopes is annoying. The semantic changes are subtle, but for a build as complex as Flink, they make a difference. - File-based communication is rarely a good idea. The bug in the failsafe plugin was caused by improper file-based communication, and some of our discovered instabilities as well. Greetings, Stephan PS: Some issues and mysteries remain for us to solve: When we allow our metrics subsystem to register JMX beans, we see some tests failing due to spontaneous JVM process kills. Whoever has a pointer there, please ping us!