I think that the tests fail because of sharing ExecutionEnvironment between 
test cases. I’m not sure why it is problem, but it is only difference between 
other ML tests.

I created a hotfix and pushed it to my repository. When it seems fixed [1], 
I’ll merge the hotfix to master branch.

[1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
 
Regards,
Chiwan Park

> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> wrote:
> 
> Maybe it seems about KNN test case which is merged into yesterday. I’ll look 
> into ML test.
> 
> Regards,
> Chiwan Park
> 
>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>> 
>> Currently, an ML test is reliably failing and occasionally some HA
>> tests. Is someone looking into the ML test?
>> 
>> For HA, I will revert a commit, which might cause the HA
>> instabilities. Till is working on a proper fix as far as I know.
>> 
>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote:
>>> Thanks for the great work! :-)
>>> 
>>> Regards,
>>> Chiwan Park
>>> 
>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> 
>>>> wrote:
>>>> 
>>>> Awesome work guys!
>>>> And even more thanks for the detailed report...This troubleshooting summary
>>>> will be undoubtedly useful for all our maven projects!
>>>> 
>>>> Best,
>>>> Flavio
>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>>> 
>>>>> Thanks for the effort, Max and Stephan! Happy to see the green light 
>>>>> again.
>>>>> 
>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>> Hi all!
>>>>>> 
>>>>>> After a few weeks of terrible build issues, I am happy to announce that
>>>>> the
>>>>>> build works again properly, and we actually get meaningful CI results.
>>>>>> 
>>>>>> Here is a story in many acts, from builds deep red to bright green joy.
>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, Max and
>>>>>> me debugged the final issue and got the build back on track.
>>>>>> 
>>>>>> ------------------
>>>>>> The Journey
>>>>>> ------------------
>>>>>> 
>>>>>> (1) Failsafe Plugin
>>>>>> 
>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which failed
>>>>>> tests did not result in a failed build.
>>>>>> 
>>>>>> That is a pretty bad bug for a plugin whose only task is to run tests and
>>>>>> fail the build if a test fails.
>>>>>> 
>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>>> 
>>>>>> 
>>>>>> (2) Failsafe Plugin Dependency Issues
>>>>>> 
>>>>>> After the upgrade, the Failsafe Plugin behaved differently and did not
>>>>>> interoperate with Dependency Shading any more.
>>>>>> 
>>>>>> Because of that, we switched to the Surefire Plugin.
>>>>>> 
>>>>>> 
>>>>>> (3) Fixing all the issues introduced in the meantime
>>>>>> 
>>>>>> Naturally, a number of test instabilities had been introduced, which
>>>>> needed
>>>>>> to be fixed.
>>>>>> 
>>>>>> 
>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>>> 
>>>>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to
>>>>> the
>>>>>> test scope.
>>>>>> Because the configuration searched for tests in the "main" scope, no Yarn
>>>>>> tests were executed for a while, until the scope was fixed.
>>>>>> 
>>>>>> 
>>>>>> (5) Yarn Tests and JMX Metrics
>>>>>> 
>>>>>> After the Yarn Tests were re-activated, we saw them fail due to warnings
>>>>>> created by the newly introduced metrics code. We could fix that by
>>>>> updating
>>>>>> the metrics code and temporarily not registering JMX beans for all
>>>>> metrics.
>>>>>> 
>>>>>> 
>>>>>> (6) Yarn / Surefire Deadlock
>>>>>> 
>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in the
>>>>> IDE).
>>>>>> It turned out that those test a command line interface that interacts
>>>>> with
>>>>>> the standard input stream.
>>>>>> 
>>>>>> The newly deployed Surefire Plugin uses standard input as well, for
>>>>>> communication with forked JVMs. Since Surefire internally locks the
>>>>>> standard input stream, the Yarn CLI cannot poll the standard input stream
>>>>>> without locking up and stalling the tests.
>>>>>> 
>>>>>> We adjusted the tests and now the build happily builds again.
>>>>>> 
>>>>>> -----------------
>>>>>> Conclusions:
>>>>>> -----------------
>>>>>> 
>>>>>> - CI is terribly crucial It took us weeks with the fallout of having a
>>>>>> period of unreliably CI.
>>>>>> 
>>>>>> - Maven could do a better job. A bug as crucial as the one that started
>>>>>> our problem should not occur in a test plugin like surefire. Also, the
>>>>>> constant change of semantics and dependency scopes is annoying. The
>>>>>> semantic changes are subtle, but for a build as complex as Flink, they
>>>>> make
>>>>>> a difference.
>>>>>> 
>>>>>> - File-based communication is rarely a good idea. The bug in the
>>>>> failsafe
>>>>>> plugin was caused by improper file-based communication, and some of our
>>>>>> discovered instabilities as well.
>>>>>> 
>>>>>> Greetings,
>>>>>> Stephan
>>>>>> 
>>>>>> 
>>>>>> PS: Some issues and mysteries remain for us to solve: When we allow our
>>>>>> metrics subsystem to register JMX beans, we see some tests failing due to
>>>>>> spontaneous JVM process kills. Whoever has a pointer there, please ping
>>>>> us!
>>>>> 
>>> 
> 

Reply via email to