Hi Stephan, Yes, right. But KNNITSuite calls ExecutionEnvironment.getExecutionEnvironment only once [1]. I’m testing with moving method call of getExecutionEnvironment to each test case.
[1]: https://github.com/apache/flink/blob/master/flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/nn/KNNITSuite.scala#L45 Regards, Chiwan Park > On May 31, 2016, at 7:09 PM, Stephan Ewen <se...@apache.org> wrote: > > Hi Chiwan! > > I think the Execution environment is not shared, because what the > TestEnvironment sets is a Context Environment Factory. Every time you call > "ExecutionEnvironment.getExecutionEnvironment()", you get a new environment. > > Stephan > > > On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> wrote: > >> I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR >> for it. >> >> From my investigation [2], cluster for ML tests have only one taskmanager >> with 4 slots. Is 2048 insufficient for total number of network numbers? I >> still think the problem is sharing ExecutionEnvironment between test cases. >> >> [1]: https://issues.apache.org/jira/browse/FLINK-3994 >> [2]: >> https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56 >> >> Regards, >> Chiwan Park >> >>> On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote: >>> >>> Thanks Stephan for the synopsis of our last weeks test instability >>> madness. It's sad to see the shortcomings of Maven test plugins but >>> another lesson learned is that our testing infrastructure should get a >>> bit more attention. We have reached a point several times where our >>> tests where inherently instable. Now we saw that even more problems >>> were hidden in the dark. I would like to see more maintenance >>> dedicated to testing. >>> >>> @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull >>> request with a systematic fix. Those things are too crucial to be >>> fixed on the go. The problems is that Travis reports the number of >>> processors to be "32" (which is used for the number of task slots in >>> local execution). The network buffers are not adjusted accordingly. We >>> should set them correctly in the MiniCluster. Also, we could define an >>> upper limit to the number of task slots for tests. >>> >>> On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> >> wrote: >>>> I think that the tests fail because of sharing ExecutionEnvironment >> between test cases. I’m not sure why it is problem, but it is only >> difference between other ML tests. >>>> >>>> I created a hotfix and pushed it to my repository. When it seems fixed >> [1], I’ll merge the hotfix to master branch. >>>> >>>> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491 >>>> >>>> Regards, >>>> Chiwan Park >>>> >>>>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> >> wrote: >>>>> >>>>> Maybe it seems about KNN test case which is merged into yesterday. >> I’ll look into ML test. >>>>> >>>>> Regards, >>>>> Chiwan Park >>>>> >>>>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote: >>>>>> >>>>>> Currently, an ML test is reliably failing and occasionally some HA >>>>>> tests. Is someone looking into the ML test? >>>>>> >>>>>> For HA, I will revert a commit, which might cause the HA >>>>>> instabilities. Till is working on a proper fix as far as I know. >>>>>> >>>>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> >> wrote: >>>>>>> Thanks for the great work! :-) >>>>>>> >>>>>>> Regards, >>>>>>> Chiwan Park >>>>>>> >>>>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier < >> pomperma...@okkam.it> wrote: >>>>>>>> >>>>>>>> Awesome work guys! >>>>>>>> And even more thanks for the detailed report...This troubleshooting >> summary >>>>>>>> will be undoubtedly useful for all our maven projects! >>>>>>>> >>>>>>>> Best, >>>>>>>> Flavio >>>>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: >>>>>>>> >>>>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green >> light again. >>>>>>>>> >>>>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> >> wrote: >>>>>>>>>> Hi all! >>>>>>>>>> >>>>>>>>>> After a few weeks of terrible build issues, I am happy to >> announce that >>>>>>>>> the >>>>>>>>>> build works again properly, and we actually get meaningful CI >> results. >>>>>>>>>> >>>>>>>>>> Here is a story in many acts, from builds deep red to bright >> green joy. >>>>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, >> Max and >>>>>>>>>> me debugged the final issue and got the build back on track. >>>>>>>>>> >>>>>>>>>> ------------------ >>>>>>>>>> The Journey >>>>>>>>>> ------------------ >>>>>>>>>> >>>>>>>>>> (1) Failsafe Plugin >>>>>>>>>> >>>>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which >> failed >>>>>>>>>> tests did not result in a failed build. >>>>>>>>>> >>>>>>>>>> That is a pretty bad bug for a plugin whose only task is to run >> tests and >>>>>>>>>> fail the build if a test fails. >>>>>>>>>> >>>>>>>>>> After we recognized that, we upgraded the Failsafe Plugin. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (2) Failsafe Plugin Dependency Issues >>>>>>>>>> >>>>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and >> did not >>>>>>>>>> interoperate with Dependency Shading any more. >>>>>>>>>> >>>>>>>>>> Because of that, we switched to the Surefire Plugin. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (3) Fixing all the issues introduced in the meantime >>>>>>>>>> >>>>>>>>>> Naturally, a number of test instabilities had been introduced, >> which >>>>>>>>> needed >>>>>>>>>> to be fixed. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (4) Yarn Tests and Test Scope Refactoring >>>>>>>>>> >>>>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn >> Tests to >>>>>>>>> the >>>>>>>>>> test scope. >>>>>>>>>> Because the configuration searched for tests in the "main" scope, >> no Yarn >>>>>>>>>> tests were executed for a while, until the scope was fixed. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (5) Yarn Tests and JMX Metrics >>>>>>>>>> >>>>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to >> warnings >>>>>>>>>> created by the newly introduced metrics code. We could fix that by >>>>>>>>> updating >>>>>>>>>> the metrics code and temporarily not registering JMX beans for all >>>>>>>>> metrics. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> (6) Yarn / Surefire Deadlock >>>>>>>>>> >>>>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in >> the >>>>>>>>> IDE). >>>>>>>>>> It turned out that those test a command line interface that >> interacts >>>>>>>>> with >>>>>>>>>> the standard input stream. >>>>>>>>>> >>>>>>>>>> The newly deployed Surefire Plugin uses standard input as well, >> for >>>>>>>>>> communication with forked JVMs. Since Surefire internally locks >> the >>>>>>>>>> standard input stream, the Yarn CLI cannot poll the standard >> input stream >>>>>>>>>> without locking up and stalling the tests. >>>>>>>>>> >>>>>>>>>> We adjusted the tests and now the build happily builds again. >>>>>>>>>> >>>>>>>>>> ----------------- >>>>>>>>>> Conclusions: >>>>>>>>>> ----------------- >>>>>>>>>> >>>>>>>>>> - CI is terribly crucial It took us weeks with the fallout of >> having a >>>>>>>>>> period of unreliably CI. >>>>>>>>>> >>>>>>>>>> - Maven could do a better job. A bug as crucial as the one that >> started >>>>>>>>>> our problem should not occur in a test plugin like surefire. >> Also, the >>>>>>>>>> constant change of semantics and dependency scopes is annoying. >> The >>>>>>>>>> semantic changes are subtle, but for a build as complex as Flink, >> they >>>>>>>>> make >>>>>>>>>> a difference. >>>>>>>>>> >>>>>>>>>> - File-based communication is rarely a good idea. The bug in the >>>>>>>>> failsafe >>>>>>>>>> plugin was caused by improper file-based communication, and some >> of our >>>>>>>>>> discovered instabilities as well. >>>>>>>>>> >>>>>>>>>> Greetings, >>>>>>>>>> Stephan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PS: Some issues and mysteries remain for us to solve: When we >> allow our >>>>>>>>>> metrics subsystem to register JMX beans, we see some tests >> failing due to >>>>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, >> please ping >>>>>>>>> us! >>>>>>>>> >>>>>>> >>>>> >>>> >> >>