it's not an open-file limit -- i have the jenkins workers set up w/a soft file limit of 100k, and a hard limit of 200k.
On Wed, Feb 15, 2017 at 12:48 PM, Armin Braun <m...@obrown.io> wrote: > I think one thing that is contributing to this a lot too is the general > issue of the tests taking up a lot of file descriptors (10k+ if I run them > on a standard Debian machine). > There are a few suits that contribute to this in particular like > `org.apache.spark.ExecutorAllocationManagerSuite` which, like a few > others, appears to consume a lot of fds. > > Wouldn't it make sense to open JIRAs about those and actively try to > reduce the resource consumption of these tests? > Seems to me these can cause a lot of unpredictable behavior (making the > reason for flaky tests hard to identify especially when there's timeouts > etc. involved) + they make it prohibitively expensive for many to test > locally imo. > > On Wed, Feb 15, 2017 at 9:24 PM, Saikat Kanjilal <sxk1...@hotmail.com> > wrote: > >> I was working on something to address this a while ago >> https://issues.apache.org/jira/browse/SPARK-9487 but the difficulty in >> testing locally made things a lot more complicated to fix for each of the >> unit tests, should we resurface this JIRA again, I would whole heartedly >> agree with the flakiness assessment of the unit tests. >> [SPARK-9487] Use the same num. worker threads in Scala ... >> <https://issues.apache.org/jira/browse/SPARK-9487> >> issues.apache.org >> In Python we use `local[4]` for unit tests, while in Scala/Java we use >> `local[2]` and `local` for some unit tests in SQL, MLLib, and other >> components. If the ... >> >> >> >> ------------------------------ >> *From:* Kay Ousterhout <kayousterh...@gmail.com> >> *Sent:* Wednesday, February 15, 2017 12:10 PM >> *To:* dev@spark.apache.org >> *Subject:* File JIRAs for all flaky test failures >> >> Hi all, >> >> I've noticed the Spark tests getting increasingly flaky -- it seems more >> common than not now that the tests need to be re-run at least once on PRs >> before they pass. This is both annoying and problematic because it makes >> it harder to tell when a PR is introducing new flakiness. >> >> To try to clean this up, I'd propose filing a JIRA *every time* Jenkins >> fails on a PR (for a reason unrelated to the PR). Just provide a quick >> description of the failure -- e.g., "Flaky test: DagSchedulerSuite" or >> "Tests failed because 250m timeout expired", a link to the failed build, >> and include the "Tests" component. If there's already a JIRA for the >> issue, just comment with a link to the latest failure. I know folks don't >> always have time to track down why a test failed, but this it at least >> helpful to someone else who, later on, is trying to diagnose when the issue >> started to find the problematic code / test. >> >> If this seems like too high overhead, feel free to suggest alternative >> ways to make the tests less flaky! >> >> -Kay >> > >