Hi all!

After a few weeks of terrible build issues, I am happy to announce that the
build works again properly, and we actually get meaningful CI results.

Here is a story in many acts, from builds deep red to bright green joy.
Kudos to Max, who did most of this troubleshooting. This evening, Max and
me debugged the final issue and got the build back on track.

------------------
The Journey
------------------

(1) Failsafe Plugin

The Maven Failsafe Build Plugin had a critical bug due to which failed
tests did not result in a failed build.

That is a pretty bad bug for a plugin whose only task is to run tests and
fail the build if a test fails.

After we recognized that, we upgraded the Failsafe Plugin.


(2) Failsafe Plugin Dependency Issues

After the upgrade, the Failsafe Plugin behaved differently and did not
interoperate with Dependency Shading any more.

Because of that, we switched to the Surefire Plugin.


(3) Fixing all the issues introduced in the meantime

Naturally, a number of test instabilities had been introduced, which needed
to be fixed.


(4) Yarn Tests and Test Scope Refactoring

In the meantime, a Pull Request was merged that moved the Yarn Tests to the
test scope.
Because the configuration searched for tests in the "main" scope, no Yarn
tests were executed for a while, until the scope was fixed.


(5) Yarn Tests and JMX Metrics

After the Yarn Tests were re-activated, we saw them fail due to warnings
created by the newly introduced metrics code. We could fix that by updating
the metrics code and temporarily not registering JMX beans for all metrics.


(6) Yarn / Surefire Deadlock

Finally, some Yarn tests failed reliably in Maven (though not in the IDE).
It turned out that those test a command line interface that interacts with
the standard input stream.

The newly deployed Surefire Plugin uses standard input as well, for
communication with forked JVMs. Since Surefire internally locks the
standard input stream, the Yarn CLI cannot poll the standard input stream
without locking up and stalling the tests.

We adjusted the tests and now the build happily builds again.

-----------------
Conclusions:
-----------------

  - CI is terribly crucial It took us weeks with the fallout of having a
period of unreliably CI.

  - Maven could do a better job. A bug as crucial as the one that started
our problem should not occur in a test plugin like surefire. Also, the
constant change of semantics and dependency scopes is annoying. The
semantic changes are subtle, but for a build as complex as Flink, they make
a difference.

  - File-based communication is rarely a good idea. The bug in the failsafe
plugin was caused by improper file-based communication, and some of our
discovered instabilities as well.

Greetings,
Stephan


PS: Some issues and mysteries remain for us to solve: When we allow our
metrics subsystem to register JMX beans, we see some tests failing due to
spontaneous JVM process kills. Whoever has a pointer there, please ping us!

Reply via email to