working on it.  we'll have intermittent downtime the next ~30 mins.

On Sun, May 21, 2017 at 12:01 PM, shane knapp <skn...@berkeley.edu> wrote:
> yeah.  i noticed that and restarted it a few minutes ago.  i'll have
> some time later this afternoon to take a closer look...   :\
>
> On Sun, May 21, 2017 at 9:08 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
>> It looked well these days. However, it seems to go down slowly again...
>>
>> When I tried to see console log (e.g.
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77149/consoleFull),
>> a server returns "proxy error."
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>>
>>
>> From:        shane knapp <skn...@berkeley.edu>
>> To:        Sean Owen <so...@cloudera.com>
>> Cc:        dev <dev@spark.apache.org>
>> Date:        2017/05/20 09:43
>> Subject:        Re: [build system] jenkins got itself wedged...
>> ________________________________
>>
>>
>>
>> last update of the week:
>>
>> things are looking great...  we're GCing happily and staying well
>> within our memory limits.
>>
>> i'm going to do one more restart after the two pull request builds
>> finish to re-enable backups, and call it a weekend.  :)
>>
>> shane
>>
>> On Fri, May 19, 2017 at 8:29 AM, shane knapp <skn...@berkeley.edu> wrote:
>>> this is hopefully my final email on the subject...   :)
>>>
>>> things have seemed to settled down after my GC tuning, and system
>>> load/cpu usage/memory has been nice and flat all night.  i'll continue
>>> to keep an eye on things but it looks like we've weathered the worst
>>> part of the storm.
>>>
>>> On Thu, May 18, 2017 at 6:40 PM, shane knapp <skn...@berkeley.edu> wrote:
>>>> after needing another restart this afternoon, i did some homework and
>>>> aggressively twiddled some GC settings[1].  since then, things have
>>>> definitely smoothed out w/regards to memory and cpu usage spikes.
>>>>
>>>> i've attached a screenshot of slightly happier looking graphs.
>>>>
>>>> still keeping an eye on things, and hoping that i can go back to being
>>>> a lurker...  ;)
>>>>
>>>> shane
>>>>
>>>> 1 - https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>
>>>> On Thu, May 18, 2017 at 11:20 AM, shane knapp <skn...@berkeley.edu>
>>>> wrote:
>>>>> ok, more updates:
>>>>>
>>>>> 1) i audited all of the builds, and found that the spark-*-compile-*
>>>>> and spark-*-test-* jobs were set to the identical cron time trigger,
>>>>> so josh rosen and i updated them to run at H/5 (instead of */5).  load
>>>>> balancing ftw.
>>>>>
>>>>> 2) the jenkins master is now running on java8, which has moar bettar
>>>>> GC management under the hood.
>>>>>
>>>>> i'll be keeping an eye on this today, and if we start seeing GC
>>>>> overhead failures, i'll start doing more GC performance tuning.
>>>>> thankfully, cloudbees has a relatively decent guide that i'll be
>>>>> following here:  https://jenkins.io/blog/2016/11/21/gc-tuning/
>>>>>
>>>>> shane
>>>>>
>>>>> On Thu, May 18, 2017 at 8:39 AM, shane knapp <skn...@berkeley.edu>
>>>>> wrote:
>>>>>> yeah, i spoke too soon.  jenkins is still misbehaving, but FINALLY i'm
>>>>>> getting some error messages in the logs...   looks like jenkins is
>>>>>> thrashing on GC.
>>>>>>
>>>>>> now that i know what's up, i should be able to get this sorted today.
>>>>>>
>>>>>> On Thu, May 18, 2017 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>> I'm not sure if it's related, but I still can't get Jenkins to test
>>>>>>> PRs. For
>>>>>>> example, triggering it through the spark-prs.appspot.com UI gives
>>>>>>> me...
>>>>>>>
>>>>>>> https://spark-prs.appspot.com/trigger-jenkins/18012
>>>>>>>
>>>>>>> Internal Server Error
>>>>>>>
>>>>>>> That might be from the appspot app though?
>>>>>>>
>>>>>>> But posting "Jenkins test this please" on PRs doesn't seem to work,
>>>>>>> and I
>>>>>>> can't reach Jenkins:
>>>>>>>
>>>>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>>>>>
>>>>>>> On Thu, May 18, 2017 at 12:44 AM shane knapp <skn...@berkeley.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> after another couple of restarts due to high load and system
>>>>>>>> unresponsiveness, i finally found what is the most likely culprit:
>>>>>>>>
>>>>>>>> a typo in the jenkins config where the java heap size was configured.
>>>>>>>> instead of -Xmx16g, we had -Dmx16G...  which could easily explain the
>>>>>>>> random and non-deterministic system hangs we've had over the past
>>>>>>>> couple of years.
>>>>>>>>
>>>>>>>> anyways, it's been corrected and the master seems to be humming
>>>>>>>> along,
>>>>>>>> for real this time, w/o issue.  i'll continue to keep an eye on this
>>>>>>>> for the rest of the week, but things are looking MUCH better now.
>>>>>>>>
>>>>>>>> sorry again for the interruptions in service.
>>>>>>>>
>>>>>>>> shane
>>>>>>>>
>>>>>>>> On Wed, May 17, 2017 at 9:59 AM, shane knapp <skn...@berkeley.edu>
>>>>>>>> wrote:
>>>>>>>> > ok, we're back up, system load looks cromulent and we're happily
>>>>>>>> > building (again).
>>>>>>>> >
>>>>>>>> > shane
>>>>>>>> >
>>>>>>>> > On Wed, May 17, 2017 at 9:50 AM, shane knapp <skn...@berkeley.edu>
>>>>>>>> > wrote:
>>>>>>>> >> i'm going to need to perform a quick reboot on the jenkins master.
>>>>>>>> >> it
>>>>>>>> >> looks like it's hung again.
>>>>>>>> >>
>>>>>>>> >> sorry about this!
>>>>>>>> >>
>>>>>>>> >> shane
>>>>>>>> >>
>>>>>>>> >> On Tue, May 16, 2017 at 12:55 PM, shane knapp
>>>>>>>> >> <skn...@berkeley.edu>
>>>>>>>> >> wrote:
>>>>>>>> >>> ...but just now i started getting alerts on system load, which
>>>>>>>> >>> was
>>>>>>>> >>> rather high.  i had to kick jenkins again, and will keep an eye
>>>>>>>> >>> on the
>>>>>>>> >>> master and possible need to reboot.
>>>>>>>> >>>
>>>>>>>> >>> sorry about the interruption of service...
>>>>>>>> >>>
>>>>>>>> >>> shane
>>>>>>>> >>>
>>>>>>>> >>> On Tue, May 16, 2017 at 8:18 AM, shane knapp
>>>>>>>> >>> <skn...@berkeley.edu>
>>>>>>>> >>> wrote:
>>>>>>>> >>>> ...so i kicked it and it's now back up and happily building.
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to