last update of the week: things are looking great... we're GCing happily and staying well within our memory limits.
i'm going to do one more restart after the two pull request builds finish to re-enable backups, and call it a weekend. :) shane On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote: > this is hopefully my final email on the subject... :) > > things have seemed to settled down after my GC tuning, and system > load/cpu usage/memory has been nice and flat all night. i'll continue > to keep an eye on things but it looks like we've weathered the worst > part of the storm. > > On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote: >> after needing another restart this afternoon, i did some homework and >> aggressively twiddled some GC settings[1]. since then, things have >> definitely smoothed out w/regards to memory and cpu usage spikes. >> >> i've attached a screenshot of slightly happier looking graphs. >> >> still keeping an eye on things, and hoping that i can go back to being >> a lurker... ;) >> >> shane >> >> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ >> >> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu> wrote: >>> ok, more updates: >>> >>> 1) i audited all of the builds, and found that the spark-*-compile-* >>> and spark-*-test-* jobs were set to the identical cron time trigger, >>> so josh rosen and i updated them to run at H/5 (instead of */5). load >>> balancing ftw. >>> >>> 2) the jenkins master is now running on java8, which has moar bettar >>> GC management under the hood. >>> >>> i'll be keeping an eye on this today, and if we start seeing GC >>> overhead failures, i'll start doing more GC performance tuning. >>> thankfully, cloudbees has a relatively decent guide that i'll be >>> following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ >>> >>> shane >>> >>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu> wrote: >>>> yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm >>>> getting some error messages in the logs... looks like jenkins is >>>> thrashing on GC. >>>> >>>> now that i know what's up, i should be able to get this sorted today. >>>> >>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote: >>>>> I'm not sure if it's related, but I still can't get Jenkins to test PRs. >>>>> For >>>>> example, triggering it through the spark-prs.appspot.com UI gives me... >>>>> >>>>> https://spark-prs.appspot.com/trigger-jenkins/18012 >>>>> >>>>> Internal Server Error >>>>> >>>>> That might be from the appspot app though? >>>>> >>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, and I >>>>> can't reach Jenkins: >>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >>>>> >>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu> wrote: >>>>>> >>>>>> after another couple of restarts due to high load and system >>>>>> unresponsiveness, i finally found what is the most likely culprit: >>>>>> >>>>>> a typo in the jenkins config where the java heap size was configured. >>>>>> instead of -Xmx16g, we had -Dmx16G... which could easily explain the >>>>>> random and non-deterministic system hangs we've had over the past >>>>>> couple of years. >>>>>> >>>>>> anyways, it's been corrected and the master seems to be humming along, >>>>>> for real this time, w/o issue. i'll continue to keep an eye on this >>>>>> for the rest of the week, but things are looking MUCH better now. >>>>>> >>>>>> sorry again for the interruptions in service. >>>>>> >>>>>> shane >>>>>> >>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu> wrote: >>>>>> > ok, we're back up, system load looks cromulent and we're happily >>>>>> > building (again). >>>>>> > >>>>>> > shane >>>>>> > >>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <skn...@berkeley.edu> >>>>>> > wrote: >>>>>> >> i'm going to need to perform a quick reboot on the jenkins master. it >>>>>> >> looks like it's hung again. >>>>>> >> >>>>>> >> sorry about this! >>>>>> >> >>>>>> >> shane >>>>>> >> >>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp <skn...@berkeley.edu> >>>>>> >> wrote: >>>>>> >>> ...but just now i started getting alerts on system load, which was >>>>>> >>> rather high. i had to kick jenkins again, and will keep an eye on >>>>>> >>> the >>>>>> >>> master and possible need to reboot. >>>>>> >>> >>>>>> >>> sorry about the interruption of service... >>>>>> >>> >>>>>> >>> shane >>>>>> >>> >>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp <skn...@berkeley.edu> >>>>>> >>> wrote: >>>>>> >>>> ...so i kicked it and it's now back up and happily building. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>> --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org