Re: [ANNOUNCE] Build Issues Solved

Chiwan Park Tue, 31 May 2016 02:54:01 -0700

I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR for 
it.


From my investigation [2], cluster for ML tests have only one taskmanager with 
4 slots. Is 2048 insufficient for total number of network numbers? I still 
think the problem is sharing ExecutionEnvironment between test cases.

[1]: https://issues.apache.org/jira/browse/FLINK-3994
[2]: 
https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56

Regards,
Chiwan Park

> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote:
> 
> Thanks Stephan for the synopsis of our last weeks test instability
> madness. It's sad to see the shortcomings of Maven test plugins but
> another lesson learned is that our testing infrastructure should get a
> bit more attention. We have reached a point several times where our
> tests where inherently instable. Now we saw that even more problems
> were hidden in the dark. I would like to see more maintenance
> dedicated to testing.
> 
> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
> request with a systematic fix. Those things are too crucial to be
> fixed on the go. The problems is that Travis reports the number of
> processors to be "32" (which is used for the number of task slots in
> local execution). The network buffers are not adjusted accordingly. We
> should set them correctly in the MiniCluster. Also, we could define an
> upper limit to the number of task slots for tests.
> 
> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> wrote:
>> I think that the tests fail because of sharing ExecutionEnvironment between 
>> test cases. I’m not sure why it is problem, but it is only difference 
>> between other ML tests.
>> 
>> I created a hotfix and pushed it to my repository. When it seems fixed [1], 
>> I’ll merge the hotfix to master branch.
>> 
>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> wrote:
>>> 
>>> Maybe it seems about KNN test case which is merged into yesterday. I’ll 
>>> look into ML test.
>>> 
>>> Regards,
>>> Chiwan Park
>>> 
>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>>>> 
>>>> Currently, an ML test is reliably failing and occasionally some HA
>>>> tests. Is someone looking into the ML test?
>>>> 
>>>> For HA, I will revert a commit, which might cause the HA
>>>> instabilities. Till is working on a proper fix as far as I know.
>>>> 
>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote:
>>>>> Thanks for the great work! :-)
>>>>> 
>>>>> Regards,
>>>>> Chiwan Park
>>>>> 
>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> 
>>>>>> wrote:
>>>>>> 
>>>>>> Awesome work guys!
>>>>>> And even more thanks for the detailed report...This troubleshooting 
>>>>>> summary
>>>>>> will be undoubtedly useful for all our maven projects!
>>>>>> 
>>>>>> Best,
>>>>>> Flavio
>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>>>>> 
>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green light 
>>>>>>> again.
>>>>>>> 
>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>>>> Hi all!
>>>>>>>> 
>>>>>>>> After a few weeks of terrible build issues, I am happy to announce that
>>>>>>> the
>>>>>>>> build works again properly, and we actually get meaningful CI results.
>>>>>>>> 
>>>>>>>> Here is a story in many acts, from builds deep red to bright green joy.
>>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, Max 
>>>>>>>> and
>>>>>>>> me debugged the final issue and got the build back on track.
>>>>>>>> 
>>>>>>>> ------------------
>>>>>>>> The Journey
>>>>>>>> ------------------
>>>>>>>> 
>>>>>>>> (1) Failsafe Plugin
>>>>>>>> 
>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which failed
>>>>>>>> tests did not result in a failed build.
>>>>>>>> 
>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run tests 
>>>>>>>> and
>>>>>>>> fail the build if a test fails.
>>>>>>>> 
>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (2) Failsafe Plugin Dependency Issues
>>>>>>>> 
>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and did not
>>>>>>>> interoperate with Dependency Shading any more.
>>>>>>>> 
>>>>>>>> Because of that, we switched to the Surefire Plugin.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (3) Fixing all the issues introduced in the meantime
>>>>>>>> 
>>>>>>>> Naturally, a number of test instabilities had been introduced, which
>>>>>>> needed
>>>>>>>> to be fixed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>>>>> 
>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to
>>>>>>> the
>>>>>>>> test scope.
>>>>>>>> Because the configuration searched for tests in the "main" scope, no 
>>>>>>>> Yarn
>>>>>>>> tests were executed for a while, until the scope was fixed.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (5) Yarn Tests and JMX Metrics
>>>>>>>> 
>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to 
>>>>>>>> warnings
>>>>>>>> created by the newly introduced metrics code. We could fix that by
>>>>>>> updating
>>>>>>>> the metrics code and temporarily not registering JMX beans for all
>>>>>>> metrics.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> (6) Yarn / Surefire Deadlock
>>>>>>>> 
>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in the
>>>>>>> IDE).
>>>>>>>> It turned out that those test a command line interface that interacts
>>>>>>> with
>>>>>>>> the standard input stream.
>>>>>>>> 
>>>>>>>> The newly deployed Surefire Plugin uses standard input as well, for
>>>>>>>> communication with forked JVMs. Since Surefire internally locks the
>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard input 
>>>>>>>> stream
>>>>>>>> without locking up and stalling the tests.
>>>>>>>> 
>>>>>>>> We adjusted the tests and now the build happily builds again.
>>>>>>>> 
>>>>>>>> -----------------
>>>>>>>> Conclusions:
>>>>>>>> -----------------
>>>>>>>> 
>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of having a
>>>>>>>> period of unreliably CI.
>>>>>>>> 
>>>>>>>> - Maven could do a better job. A bug as crucial as the one that started
>>>>>>>> our problem should not occur in a test plugin like surefire. Also, the
>>>>>>>> constant change of semantics and dependency scopes is annoying. The
>>>>>>>> semantic changes are subtle, but for a build as complex as Flink, they
>>>>>>> make
>>>>>>>> a difference.
>>>>>>>> 
>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>>>>>>> failsafe
>>>>>>>> plugin was caused by improper file-based communication, and some of our
>>>>>>>> discovered instabilities as well.
>>>>>>>> 
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we allow our
>>>>>>>> metrics subsystem to register JMX beans, we see some tests failing due 
>>>>>>>> to
>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, please ping
>>>>>>> us!
>>>>>>> 
>>>>> 
>>> 
>>

Re: [ANNOUNCE] Build Issues Solved

Reply via email to