Re: [ANNOUNCE] Build Issues Solved

Chiwan Park Tue, 31 May 2016 03:19:20 -0700

Hi Stephan,

Yes, right. But KNNITSuite calls ExecutionEnvironment.getExecutionEnvironment 
only once [1]. I’m testing with moving method call of getExecutionEnvironment 
to each test case.


[1]: 
https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45

Regards,
Chiwan Park

> On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi Chiwan!
> 
> I think the Execution environment is not shared, because what the
> TestEnvironment sets is a Context Environment Factory. Every time you call
> "ExecutionEnvironment.getExecutionEnvironment()", you get a new environment.
> 
> Stephan
> 
> 
> On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> wrote:
> 
>> I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR
>> for it.
>> 
>> From my investigation [2], cluster for ML tests have only one taskmanager
>> with 4 slots. Is 2048 insufficient for total number of network numbers? I
>> still think the problem is sharing ExecutionEnvironment between test cases.
>> 
>> [1]: https://issues.apache.org/jira/browse/FLINK-3994
>> [2]:
>> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote:
>>> 
>>> Thanks Stephan for the synopsis of our last weeks test instability
>>> madness. It's sad to see the shortcomings of Maven test plugins but
>>> another lesson learned is that our testing infrastructure should get a
>>> bit more attention. We have reached a point several times where our
>>> tests where inherently instable. Now we saw that even more problems
>>> were hidden in the dark. I would like to see more maintenance
>>> dedicated to testing.
>>> 
>>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull
>>> request with a systematic fix. Those things are too crucial to be
>>> fixed on the go. The problems is that Travis reports the number of
>>> processors to be "32" (which is used for the number of task slots in
>>> local execution). The network buffers are not adjusted accordingly. We
>>> should set them correctly in the MiniCluster. Also, we could define an
>>> upper limit to the number of task slots for tests.
>>> 
>>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org>
>> wrote:
>>>> I think that the tests fail because of sharing ExecutionEnvironment
>> between test cases. I’m not sure why it is problem, but it is only
>> difference between other ML tests.
>>>> 
>>>> I created a hotfix and pushed it to my repository. When it seems fixed
>> [1], I’ll merge the hotfix to master branch.
>>>> 
>>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491
>>>> 
>>>> Regards,
>>>> Chiwan Park
>>>> 
>>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org>
>> wrote:
>>>>> 
>>>>> Maybe it seems about KNN test case which is merged into yesterday.
>> I’ll look into ML test.
>>>>> 
>>>>> Regards,
>>>>> Chiwan Park
>>>>> 
>>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote:
>>>>>> 
>>>>>> Currently, an ML test is reliably failing and occasionally some HA
>>>>>> tests. Is someone looking into the ML test?
>>>>>> 
>>>>>> For HA, I will revert a commit, which might cause the HA
>>>>>> instabilities. Till is working on a proper fix as far as I know.
>>>>>> 
>>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org>
>> wrote:
>>>>>>> Thanks for the great work! :-)
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Chiwan Park
>>>>>>> 
>>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <
>> pomperma...@okkam.it> wrote:
>>>>>>>> 
>>>>>>>> Awesome work guys!
>>>>>>>> And even more thanks for the detailed report...This troubleshooting
>> summary
>>>>>>>> will be undoubtedly useful for all our maven projects!
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote:
>>>>>>>> 
>>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green
>> light again.
>>>>>>>>> 
>>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org>
>> wrote:
>>>>>>>>>> Hi all!
>>>>>>>>>> 
>>>>>>>>>> After a few weeks of terrible build issues, I am happy to
>> announce that
>>>>>>>>> the
>>>>>>>>>> build works again properly, and we actually get meaningful CI
>> results.
>>>>>>>>>> 
>>>>>>>>>> Here is a story in many acts, from builds deep red to bright
>> green joy.
>>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening,
>> Max and
>>>>>>>>>> me debugged the final issue and got the build back on track.
>>>>>>>>>> 
>>>>>>>>>> ------------------
>>>>>>>>>> The Journey
>>>>>>>>>> ------------------
>>>>>>>>>> 
>>>>>>>>>> (1) Failsafe Plugin
>>>>>>>>>> 
>>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which
>> failed
>>>>>>>>>> tests did not result in a failed build.
>>>>>>>>>> 
>>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run
>> tests and
>>>>>>>>>> fail the build if a test fails.
>>>>>>>>>> 
>>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (2) Failsafe Plugin Dependency Issues
>>>>>>>>>> 
>>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and
>> did not
>>>>>>>>>> interoperate with Dependency Shading any more.
>>>>>>>>>> 
>>>>>>>>>> Because of that, we switched to the Surefire Plugin.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (3) Fixing all the issues introduced in the meantime
>>>>>>>>>> 
>>>>>>>>>> Naturally, a number of test instabilities had been introduced,
>> which
>>>>>>>>> needed
>>>>>>>>>> to be fixed.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring
>>>>>>>>>> 
>>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn
>> Tests to
>>>>>>>>> the
>>>>>>>>>> test scope.
>>>>>>>>>> Because the configuration searched for tests in the "main" scope,
>> no Yarn
>>>>>>>>>> tests were executed for a while, until the scope was fixed.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (5) Yarn Tests and JMX Metrics
>>>>>>>>>> 
>>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to
>> warnings
>>>>>>>>>> created by the newly introduced metrics code. We could fix that by
>>>>>>>>> updating
>>>>>>>>>> the metrics code and temporarily not registering JMX beans for all
>>>>>>>>> metrics.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (6) Yarn / Surefire Deadlock
>>>>>>>>>> 
>>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in
>> the
>>>>>>>>> IDE).
>>>>>>>>>> It turned out that those test a command line interface that
>> interacts
>>>>>>>>> with
>>>>>>>>>> the standard input stream.
>>>>>>>>>> 
>>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well,
>> for
>>>>>>>>>> communication with forked JVMs. Since Surefire internally locks
>> the
>>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard
>> input stream
>>>>>>>>>> without locking up and stalling the tests.
>>>>>>>>>> 
>>>>>>>>>> We adjusted the tests and now the build happily builds again.
>>>>>>>>>> 
>>>>>>>>>> -----------------
>>>>>>>>>> Conclusions:
>>>>>>>>>> -----------------
>>>>>>>>>> 
>>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of
>> having a
>>>>>>>>>> period of unreliably CI.
>>>>>>>>>> 
>>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that
>> started
>>>>>>>>>> our problem should not occur in a test plugin like surefire.
>> Also, the
>>>>>>>>>> constant change of semantics and dependency scopes is annoying.
>> The
>>>>>>>>>> semantic changes are subtle, but for a build as complex as Flink,
>> they
>>>>>>>>> make
>>>>>>>>>> a difference.
>>>>>>>>>> 
>>>>>>>>>> - File-based communication is rarely a good idea. The bug in the
>>>>>>>>> failsafe
>>>>>>>>>> plugin was caused by improper file-based communication, and some
>> of our
>>>>>>>>>> discovered instabilities as well.
>>>>>>>>>> 
>>>>>>>>>> Greetings,
>>>>>>>>>> Stephan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we
>> allow our
>>>>>>>>>> metrics subsystem to register JMX beans, we see some tests
>> failing due to
>>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there,
>> please ping
>>>>>>>>> us!
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [ANNOUNCE] Build Issues Solved

Reply via email to