I’ve created a JIRA issue [1] related to KNN test cases. I will send a PR for it.
From my investigation [2], cluster for ML tests have only one taskmanager with 4 slots. Is 2048 insufficient for total number of network numbers? I still think the problem is sharing ExecutionEnvironment between test cases. [1]: https://issues.apache.org/jira/browse/FLINK-3994 [2]: https://github.com/apache/flink/blob/master/flink-test-utils/src/test/scala/org/apache/flink/test/util/FlinkTestBase.scala#L56 Regards, Chiwan Park > On May 31, 2016, at 6:05 PM, Maximilian Michels <m...@apache.org> wrote: > > Thanks Stephan for the synopsis of our last weeks test instability > madness. It's sad to see the shortcomings of Maven test plugins but > another lesson learned is that our testing infrastructure should get a > bit more attention. We have reached a point several times where our > tests where inherently instable. Now we saw that even more problems > were hidden in the dark. I would like to see more maintenance > dedicated to testing. > > @Chiwan: Please, no hotfix! Please open a JIRA issue and a pull > request with a systematic fix. Those things are too crucial to be > fixed on the go. The problems is that Travis reports the number of > processors to be "32" (which is used for the number of task slots in > local execution). The network buffers are not adjusted accordingly. We > should set them correctly in the MiniCluster. Also, we could define an > upper limit to the number of task slots for tests. > > On Tue, May 31, 2016 at 10:59 AM, Chiwan Park <chiwanp...@apache.org> wrote: >> I think that the tests fail because of sharing ExecutionEnvironment between >> test cases. I’m not sure why it is problem, but it is only difference >> between other ML tests. >> >> I created a hotfix and pushed it to my repository. When it seems fixed [1], >> I’ll merge the hotfix to master branch. >> >> [1]: https://travis-ci.org/chiwanpark/flink/builds/134104491 >> >> Regards, >> Chiwan Park >> >>> On May 31, 2016, at 5:43 PM, Chiwan Park <chiwanp...@apache.org> wrote: >>> >>> Maybe it seems about KNN test case which is merged into yesterday. I’ll >>> look into ML test. >>> >>> Regards, >>> Chiwan Park >>> >>>> On May 31, 2016, at 5:38 PM, Ufuk Celebi <u...@apache.org> wrote: >>>> >>>> Currently, an ML test is reliably failing and occasionally some HA >>>> tests. Is someone looking into the ML test? >>>> >>>> For HA, I will revert a commit, which might cause the HA >>>> instabilities. Till is working on a proper fix as far as I know. >>>> >>>> On Tue, May 31, 2016 at 3:50 AM, Chiwan Park <chiwanp...@apache.org> wrote: >>>>> Thanks for the great work! :-) >>>>> >>>>> Regards, >>>>> Chiwan Park >>>>> >>>>>> On May 31, 2016, at 7:47 AM, Flavio Pompermaier <pomperma...@okkam.it> >>>>>> wrote: >>>>>> >>>>>> Awesome work guys! >>>>>> And even more thanks for the detailed report...This troubleshooting >>>>>> summary >>>>>> will be undoubtedly useful for all our maven projects! >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> On 30 May 2016 23:47, "Ufuk Celebi" <u...@apache.org> wrote: >>>>>> >>>>>>> Thanks for the effort, Max and Stephan! Happy to see the green light >>>>>>> again. >>>>>>> >>>>>>> On Mon, May 30, 2016 at 11:03 PM, Stephan Ewen <se...@apache.org> wrote: >>>>>>>> Hi all! >>>>>>>> >>>>>>>> After a few weeks of terrible build issues, I am happy to announce that >>>>>>> the >>>>>>>> build works again properly, and we actually get meaningful CI results. >>>>>>>> >>>>>>>> Here is a story in many acts, from builds deep red to bright green joy. >>>>>>>> Kudos to Max, who did most of this troubleshooting. This evening, Max >>>>>>>> and >>>>>>>> me debugged the final issue and got the build back on track. >>>>>>>> >>>>>>>> ------------------ >>>>>>>> The Journey >>>>>>>> ------------------ >>>>>>>> >>>>>>>> (1) Failsafe Plugin >>>>>>>> >>>>>>>> The Maven Failsafe Build Plugin had a critical bug due to which failed >>>>>>>> tests did not result in a failed build. >>>>>>>> >>>>>>>> That is a pretty bad bug for a plugin whose only task is to run tests >>>>>>>> and >>>>>>>> fail the build if a test fails. >>>>>>>> >>>>>>>> After we recognized that, we upgraded the Failsafe Plugin. >>>>>>>> >>>>>>>> >>>>>>>> (2) Failsafe Plugin Dependency Issues >>>>>>>> >>>>>>>> After the upgrade, the Failsafe Plugin behaved differently and did not >>>>>>>> interoperate with Dependency Shading any more. >>>>>>>> >>>>>>>> Because of that, we switched to the Surefire Plugin. >>>>>>>> >>>>>>>> >>>>>>>> (3) Fixing all the issues introduced in the meantime >>>>>>>> >>>>>>>> Naturally, a number of test instabilities had been introduced, which >>>>>>> needed >>>>>>>> to be fixed. >>>>>>>> >>>>>>>> >>>>>>>> (4) Yarn Tests and Test Scope Refactoring >>>>>>>> >>>>>>>> In the meantime, a Pull Request was merged that moved the Yarn Tests to >>>>>>> the >>>>>>>> test scope. >>>>>>>> Because the configuration searched for tests in the "main" scope, no >>>>>>>> Yarn >>>>>>>> tests were executed for a while, until the scope was fixed. >>>>>>>> >>>>>>>> >>>>>>>> (5) Yarn Tests and JMX Metrics >>>>>>>> >>>>>>>> After the Yarn Tests were re-activated, we saw them fail due to >>>>>>>> warnings >>>>>>>> created by the newly introduced metrics code. We could fix that by >>>>>>> updating >>>>>>>> the metrics code and temporarily not registering JMX beans for all >>>>>>> metrics. >>>>>>>> >>>>>>>> >>>>>>>> (6) Yarn / Surefire Deadlock >>>>>>>> >>>>>>>> Finally, some Yarn tests failed reliably in Maven (though not in the >>>>>>> IDE). >>>>>>>> It turned out that those test a command line interface that interacts >>>>>>> with >>>>>>>> the standard input stream. >>>>>>>> >>>>>>>> The newly deployed Surefire Plugin uses standard input as well, for >>>>>>>> communication with forked JVMs. Since Surefire internally locks the >>>>>>>> standard input stream, the Yarn CLI cannot poll the standard input >>>>>>>> stream >>>>>>>> without locking up and stalling the tests. >>>>>>>> >>>>>>>> We adjusted the tests and now the build happily builds again. >>>>>>>> >>>>>>>> ----------------- >>>>>>>> Conclusions: >>>>>>>> ----------------- >>>>>>>> >>>>>>>> - CI is terribly crucial It took us weeks with the fallout of having a >>>>>>>> period of unreliably CI. >>>>>>>> >>>>>>>> - Maven could do a better job. A bug as crucial as the one that started >>>>>>>> our problem should not occur in a test plugin like surefire. Also, the >>>>>>>> constant change of semantics and dependency scopes is annoying. The >>>>>>>> semantic changes are subtle, but for a build as complex as Flink, they >>>>>>> make >>>>>>>> a difference. >>>>>>>> >>>>>>>> - File-based communication is rarely a good idea. The bug in the >>>>>>> failsafe >>>>>>>> plugin was caused by improper file-based communication, and some of our >>>>>>>> discovered instabilities as well. >>>>>>>> >>>>>>>> Greetings, >>>>>>>> Stephan >>>>>>>> >>>>>>>> >>>>>>>> PS: Some issues and mysteries remain for us to solve: When we allow our >>>>>>>> metrics subsystem to register JMX beans, we see some tests failing due >>>>>>>> to >>>>>>>> spontaneous JVM process kills. Whoever has a pointer there, please ping >>>>>>> us! >>>>>>> >>>>> >>> >>