Re: [build system] jenkins got itself wedged...

shane knapp Sun, 21 May 2017 12:01:43 -0700

yeah.  i noticed that and restarted it a few minutes ago.  i'll have
some time later this afternoon to take a closer look...   :\


On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
> It looked well these days. However, it seems to go down slowly again...
>
> When I tried to see console log (e.g.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
> a server returns "proxy error."
>
> Regards,
> Kazuaki Ishizaki
>
>
>
> From:        shane knapp <skn...@berkeley.edu>
> To:        Sean Owen <so...@cloudera.com>
> Cc:        dev <dev@spark.apache.org>
> Date:        2017/05/20 09:43
> Subject:        Re: [build system] jenkins got itself wedged...
> ________________________________
>
>
>
> last update of the week:
>
> things are looking great...  we're GCing happily and staying well
> within our memory limits.
>
> i'm going to do one more restart after the two pull request builds
> finish to re-enable backups, and call it a weekend.  :)
>
> shane
>
> On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
>> this is hopefully my final email on the subject...   :)
>>
>> things have seemed to settled down after my GC tuning, and system
>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>> to keep an eye on things but it looks like we've weathered the worst
>> part of the storm.
>>
>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote:
>>> after needing another restart this afternoon, i did some homework and
>>> aggressively twiddled some GC settings[1].  since then, things have
>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>
>>> i've attached a screenshot of slightly happier looking graphs.
>>>
>>> still keeping an eye on things, and hoping that i can go back to being
>>> a lurker...  ;)
>>>
>>> shane
>>>
>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>
>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu>
>>> wrote:
>>>> ok, more updates:
>>>>
>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>> balancing ftw.
>>>>
>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>> GC management under the hood.
>>>>
>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>> overhead failures, i'll start doing more GC performance tuning.
>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> shane
>>>>
>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu>
>>>> wrote:
>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>> thrashing on GC.
>>>>>
>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>
>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>> PRs. For
>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>> me...
>>>>>>
>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>
>>>>>> Internal Server Error
>>>>>>
>>>>>> That might be from the appspot app though?
>>>>>>
>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>>>>> and I
>>>>>> can't reach Jenkins:
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu>
>>>>>> wrote:
>>>>>>>
>>>>>>> after another couple of restarts due to high load and system
>>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>>
>>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>>> couple of years.
>>>>>>>
>>>>>>> anyways, it's been corrected and the master seems to be humming
>>>>>>> along,
>>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>>
>>>>>>> sorry again for the interruptions in service.
>>>>>>>
>>>>>>> shane
>>>>>>>
>>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu>
>>>>>>> wrote:
>>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>>> > building (again).
>>>>>>> >
>>>>>>> > shane
>>>>>>> >
>>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <skn...@berkeley.edu>
>>>>>>> > wrote:
>>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.
>>>>>>> >> it
>>>>>>> >> looks like it's hung again.
>>>>>>> >>
>>>>>>> >> sorry about this!
>>>>>>> >>
>>>>>>> >> shane
>>>>>>> >>
>>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp
>>>>>>> >> <skn...@berkeley.edu>
>>>>>>> >> wrote:
>>>>>>> >>> ...but just now i started getting alerts on system load, which
>>>>>>> >>> was
>>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye
>>>>>>> >>> on the
>>>>>>> >>> master and possible need to reboot.
>>>>>>> >>>
>>>>>>> >>> sorry about the interruption of service...
>>>>>>> >>>
>>>>>>> >>> shane
>>>>>>> >>>
>>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp
>>>>>>> >>> <skn...@berkeley.edu>
>>>>>>> >>> wrote:
>>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [build system] jenkins got itself wedged...

Reply via email to