Thanks for the effort, Max and Stephan! Happy to see the green light again.

On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote:
> Hi all!
>
> After a few weeks of terrible build issues, I am happy to announce that the
> build works again properly, and we actually get meaningful CI results.
>
> Here is a story in many acts, from builds deep red to bright green joy.
> Kudos to Max, who did most of this troubleshooting. This evening, Max and
> me debugged the final issue and got the build back on track.
>
> ------------------
> The Journey
> ------------------
>
> (1) Failsafe Plugin
>
> The Maven Failsafe Build Plugin had a critical bug due to which failed
> tests did not result in a failed build.
>
> That is a pretty bad bug for a plugin whose only task is to run tests and
> fail the build if a test fails.
>
> After we recognized that, we upgraded the Failsafe Plugin.
>
>
> (2) Failsafe Plugin Dependency Issues
>
> After the upgrade, the Failsafe Plugin behaved differently and did not
> interoperate with Dependency Shading any more.
>
> Because of that, we switched to the Surefire Plugin.
>
>
> (3) Fixing all the issues introduced in the meantime
>
> Naturally, a number of test instabilities had been introduced, which needed
> to be fixed.
>
>
> (4) Yarn Tests and Test Scope Refactoring
>
> In the meantime, a Pull Request was merged that moved the Yarn Tests to the
> test scope.
> Because the configuration searched for tests in the "main" scope, no Yarn
> tests were executed for a while, until the scope was fixed.
>
>
> (5) Yarn Tests and JMX Metrics
>
> After the Yarn Tests were re-activated, we saw them fail due to warnings
> created by the newly introduced metrics code. We could fix that by updating
> the metrics code and temporarily not registering JMX beans for all metrics.
>
>
> (6) Yarn / Surefire Deadlock
>
> Finally, some Yarn tests failed reliably in Maven (though not in the IDE).
> It turned out that those test a command line interface that interacts with
> the standard input stream.
>
> The newly deployed Surefire Plugin uses standard input as well, for
> communication with forked JVMs. Since Surefire internally locks the
> standard input stream, the Yarn CLI cannot poll the standard input stream
> without locking up and stalling the tests.
>
> We adjusted the tests and now the build happily builds again.
>
> -----------------
> Conclusions:
> -----------------
>
>   - CI is terribly crucial It took us weeks with the fallout of having a
> period of unreliably CI.
>
>   - Maven could do a better job. A bug as crucial as the one that started
> our problem should not occur in a test plugin like surefire. Also, the
> constant change of semantics and dependency scopes is annoying. The
> semantic changes are subtle, but for a build as complex as Flink, they make
> a difference.
>
>   - File-based communication is rarely a good idea. The bug in the failsafe
> plugin was caused by improper file-based communication, and some of our
> discovered instabilities as well.
>
> Greetings,
> Stephan
>
>
> PS: Some issues and mysteries remain for us to solve: When we allow our
> metrics subsystem to register JMX beans, we see some tests failing due to
> spontaneous JVM process kills. Whoever has a pointer there, please ping us!

Reply via email to