yeah. i noticed that and restarted it a few minutes ago. i'll have some time later this afternoon to take a closer look... :\
On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote: > It looked well these days. However, it seems to go down slowly again... > > When I tried to see console log (e.g. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull), > a server returns "proxy error." > > Regards, > Kazuaki Ishizaki > > > > From: shane knapp <skn...@berkeley.edu> > To: Sean Owen <so...@cloudera.com> > Cc: dev <dev@spark.apache.org> > Date: 2017/05/20 09:43 > Subject: Re: [build system] jenkins got itself wedged... > ________________________________ > > > > last update of the week: > > things are looking great... we're GCing happily and staying well > within our memory limits. > > i'm going to do one more restart after the two pull request builds > finish to re-enable backups, and call it a weekend. :) > > shane > > On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote: >> this is hopefully my final email on the subject... :) >> >> things have seemed to settled down after my GC tuning, and system >> load/cpu usage/memory has been nice and flat all night. i'll continue >> to keep an eye on things but it looks like we've weathered the worst >> part of the storm. >> >> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote: >>> after needing another restart this afternoon, i did some homework and >>> aggressively twiddled some GC settings[1]. since then, things have >>> definitely smoothed out w/regards to memory and cpu usage spikes. >>> >>> i've attached a screenshot of slightly happier looking graphs. >>> >>> still keeping an eye on things, and hoping that i can go back to being >>> a lurker... ;) >>> >>> shane >>> >>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/ >>> >>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu> >>> wrote: >>>> ok, more updates: >>>> >>>> 1) i audited all of the builds, and found that the spark-*-compile-* >>>> and spark-*-test-* jobs were set to the identical cron time trigger, >>>> so josh rosen and i updated them to run at H/5 (instead of */5). load >>>> balancing ftw. >>>> >>>> 2) the jenkins master is now running on java8, which has moar bettar >>>> GC management under the hood. >>>> >>>> i'll be keeping an eye on this today, and if we start seeing GC >>>> overhead failures, i'll start doing more GC performance tuning. >>>> thankfully, cloudbees has a relatively decent guide that i'll be >>>> following here: https://jenkins.io/blog/2016/11/21/gc-tuning/ >>>> >>>> shane >>>> >>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu> >>>> wrote: >>>>> yeah, i spoke too soon. jenkins is still misbehaving, but FINALLY i'm >>>>> getting some error messages in the logs... looks like jenkins is >>>>> thrashing on GC. >>>>> >>>>> now that i know what's up, i should be able to get this sorted today. >>>>> >>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote: >>>>>> I'm not sure if it's related, but I still can't get Jenkins to test >>>>>> PRs. For >>>>>> example, triggering it through the spark-prs.appspot.com UI gives >>>>>> me... >>>>>> >>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012 >>>>>> >>>>>> Internal Server Error >>>>>> >>>>>> That might be from the appspot app though? >>>>>> >>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work, >>>>>> and I >>>>>> can't reach Jenkins: >>>>>> >>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/ >>>>>> >>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu> >>>>>> wrote: >>>>>>> >>>>>>> after another couple of restarts due to high load and system >>>>>>> unresponsiveness, i finally found what is the most likely culprit: >>>>>>> >>>>>>> a typo in the jenkins config where the java heap size was configured. >>>>>>> instead of -Xmx16g, we had -Dmx16G... which could easily explain the >>>>>>> random and non-deterministic system hangs we've had over the past >>>>>>> couple of years. >>>>>>> >>>>>>> anyways, it's been corrected and the master seems to be humming >>>>>>> along, >>>>>>> for real this time, w/o issue. i'll continue to keep an eye on this >>>>>>> for the rest of the week, but things are looking MUCH better now. >>>>>>> >>>>>>> sorry again for the interruptions in service. >>>>>>> >>>>>>> shane >>>>>>> >>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu> >>>>>>> wrote: >>>>>>> > ok, we're back up, system load looks cromulent and we're happily >>>>>>> > building (again). >>>>>>> > >>>>>>> > shane >>>>>>> > >>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <skn...@berkeley.edu> >>>>>>> > wrote: >>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master. >>>>>>> >> it >>>>>>> >> looks like it's hung again. >>>>>>> >> >>>>>>> >> sorry about this! >>>>>>> >> >>>>>>> >> shane >>>>>>> >> >>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp >>>>>>> >> <skn...@berkeley.edu> >>>>>>> >> wrote: >>>>>>> >>> ...but just now i started getting alerts on system load, which >>>>>>> >>> was >>>>>>> >>> rather high. i had to kick jenkins again, and will keep an eye >>>>>>> >>> on the >>>>>>> >>> master and possible need to reboot. >>>>>>> >>> >>>>>>> >>> sorry about the interruption of service... >>>>>>> >>> >>>>>>> >>> shane >>>>>>> >>> >>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp >>>>>>> >>> <skn...@berkeley.edu> >>>>>>> >>> wrote: >>>>>>> >>>> ...so i kicked it and it's now back up and happily building. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>> > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org