Hi Chiwan! I think the Execution environment is not shared, because what the TestEnvironment sets is a Context Environment Factory. Every time you call "ExecutionEnvironment.getExecutionEnvironment()", you get a new environment.
Stephan On Tue, May 31, 2016 at 11:53 AM, Chiwan Park <chiwanp...@apache.org> wrote: > I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR > for it. > > From my investigation [2], cluster for ML tests have only one taskmanager > with 4 slots. Is 2048 insufficient for total number of network numbers? I > still think the problem is sharing ExecutionEnvironment between test cases. > > [1]: https://issues.apache.org/jira/browse/FLINK-3994 > [2]: > https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56 > > Regards, > Chiwan Park > > > On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote: > > > > Thanks Stephan for the synopsis of our last weeks test instability > > madness. It's sad to see the shortcomings of Maven test plugins but > > another lesson learned is that our testing infrastructure should get a > > bit more attention. We have reached a point several times where our > > tests where inherently instable. Now we saw that even more problems > > were hidden in the dark. I would like to see more maintenance > > dedicated to testing. > > > > @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull > > request with a systematic fix. Those things are too crucial to be > > fixed on the go. The problems is that Travis reports the number of > > processors to be "32" (which is used for the number of task slots in > > local execution). The network buffers are not adjusted accordingly. We > > should set them correctly in the MiniCluster. Also, we could define an > > upper limit to the number of task slots for tests. > > > > On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> > wrote: > >> I think that the tests fail because of sharing ExecutionEnvironment > between test cases. I’m not sure why it is problem, but it is only > difference between other ML tests. > >> > >> I created a hotfix and pushed it to my repository. When it seems fixed > [1], I’ll merge the hotfix to master branch. > >> > >> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491 > >> > >> Regards, > >> Chiwan Park > >> > >>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> > wrote: > >>> > >>> Maybe it seems about KNN test case which is merged into yesterday. > I’ll look into ML test. > >>> > >>> Regards, > >>> Chiwan Park > >>> > >>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote: > >>>> > >>>> Currently, an ML test is reliably failing and occasionally some HA > >>>> tests. Is someone looking into the ML test? > >>>> > >>>> For HA, I will revert a commit, which might cause the HA > >>>> instabilities. Till is working on a proper fix as far as I know. > >>>> > >>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> > wrote: > >>>>> Thanks for the great work! :-) > >>>>> > >>>>> Regards, > >>>>> Chiwan Park > >>>>> > >>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier < > pomperma...@okkam.it> wrote: > >>>>>> > >>>>>> Awesome work guys! > >>>>>> And even more thanks for the detailed report...This troubleshooting > summary > >>>>>> will be undoubtedly useful for all our maven projects! > >>>>>> > >>>>>> Best, > >>>>>> Flavio > >>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: > >>>>>> > >>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green > light again. > >>>>>>> > >>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> > wrote: > >>>>>>>> Hi all! > >>>>>>>> > >>>>>>>> After a few weeks of terrible build issues, I am happy to > announce that > >>>>>>> the > >>>>>>>> build works again properly, and we actually get meaningful CI > results. > >>>>>>>> > >>>>>>>> Here is a story in many acts, from builds deep red to bright > green joy. > >>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, > Max and > >>>>>>>> me debugged the final issue and got the build back on track. > >>>>>>>> > >>>>>>>> ------------------ > >>>>>>>> The Journey > >>>>>>>> ------------------ > >>>>>>>> > >>>>>>>> (1) Failsafe Plugin > >>>>>>>> > >>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which > failed > >>>>>>>> tests did not result in a failed build. > >>>>>>>> > >>>>>>>> That is a pretty bad bug for a plugin whose only task is to run > tests and > >>>>>>>> fail the build if a test fails. > >>>>>>>> > >>>>>>>> After we recognized that, we upgraded the Failsafe Plugin. > >>>>>>>> > >>>>>>>> > >>>>>>>> (2) Failsafe Plugin Dependency Issues > >>>>>>>> > >>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and > did not > >>>>>>>> interoperate with Dependency Shading any more. > >>>>>>>> > >>>>>>>> Because of that, we switched to the Surefire Plugin. > >>>>>>>> > >>>>>>>> > >>>>>>>> (3) Fixing all the issues introduced in the meantime > >>>>>>>> > >>>>>>>> Naturally, a number of test instabilities had been introduced, > which > >>>>>>> needed > >>>>>>>> to be fixed. > >>>>>>>> > >>>>>>>> > >>>>>>>> (4) Yarn Tests and Test Scope Refactoring > >>>>>>>> > >>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn > Tests to > >>>>>>> the > >>>>>>>> test scope. > >>>>>>>> Because the configuration searched for tests in the "main" scope, > no Yarn > >>>>>>>> tests were executed for a while, until the scope was fixed. > >>>>>>>> > >>>>>>>> > >>>>>>>> (5) Yarn Tests and JMX Metrics > >>>>>>>> > >>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to > warnings > >>>>>>>> created by the newly introduced metrics code. We could fix that by > >>>>>>> updating > >>>>>>>> the metrics code and temporarily not registering JMX beans for all > >>>>>>> metrics. > >>>>>>>> > >>>>>>>> > >>>>>>>> (6) Yarn / Surefire Deadlock > >>>>>>>> > >>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in > the > >>>>>>> IDE). > >>>>>>>> It turned out that those test a command line interface that > interacts > >>>>>>> with > >>>>>>>> the standard input stream. > >>>>>>>> > >>>>>>>> The newly deployed Surefire Plugin uses standard input as well, > for > >>>>>>>> communication with forked JVMs. Since Surefire internally locks > the > >>>>>>>> standard input stream, the Yarn CLI cannot poll the standard > input stream > >>>>>>>> without locking up and stalling the tests. > >>>>>>>> > >>>>>>>> We adjusted the tests and now the build happily builds again. > >>>>>>>> > >>>>>>>> ----------------- > >>>>>>>> Conclusions: > >>>>>>>> ----------------- > >>>>>>>> > >>>>>>>> - CI is terribly crucial It took us weeks with the fallout of > having a > >>>>>>>> period of unreliably CI. > >>>>>>>> > >>>>>>>> - Maven could do a better job. A bug as crucial as the one that > started > >>>>>>>> our problem should not occur in a test plugin like surefire. > Also, the > >>>>>>>> constant change of semantics and dependency scopes is annoying. > The > >>>>>>>> semantic changes are subtle, but for a build as complex as Flink, > they > >>>>>>> make > >>>>>>>> a difference. > >>>>>>>> > >>>>>>>> - File-based communication is rarely a good idea. The bug in the > >>>>>>> failsafe > >>>>>>>> plugin was caused by improper file-based communication, and some > of our > >>>>>>>> discovered instabilities as well. > >>>>>>>> > >>>>>>>> Greetings, > >>>>>>>> Stephan > >>>>>>>> > >>>>>>>> > >>>>>>>> PS: Some issues and mysteries remain for us to solve: When we > allow our > >>>>>>>> metrics subsystem to register JMX beans, we see some tests > failing due to > >>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, > please ping > >>>>>>> us! > >>>>>>> > >>>>> > >>> > >> > >