Yeah, that's what I figured -- those workers are under load. Thanks. On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ <skn...@berkeley.edu> wrote:
> only 125561, 125562 and 125564 were impacted by -9. > > 125565 exited w/a code of 15 (143 - 128), which means the process was > terminated for unknown reasons. > > 125563 looks like mima failed due to a bunch of errors. > > i just spot checked a bunch of recent failed PRB builds from today and > they all seemed to be legit. > > another thing that might be happening is an overload of PRB builds on the > workers due to the backlog... the workers are under a LOT of load right > now, and i can put some rate limiting in to see if that helps out. > > shane > > On Fri, Jul 10, 2020 at 11:31 AM Frank Yin <ukby.1...@gmail.com> wrote: > >> Like from build number 125565 to 125561, all impacted by kill -9. >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console >> >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console >> >> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ <skn...@berkeley.edu> >> wrote: >> >>> define "a lot" and provide some links to those builds, please. there >>> are roughly 2000 builds per day, and i can't do more than keep a cursory >>> eye on things. >>> >>> the infrastructure that the tests run on hasn't changed one bit on any >>> of the workers, and 'kill -9' could be a timeout, flakiness caused by old >>> build processes remaining on the workers after the master went down, or me >>> trying to clean things up w/o a reboot. or, perhaps, something wrong w/the >>> infra. :) >>> >>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin <ukby.1...@gmail.com> wrote: >>> >>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that >>>> infrastructure? >>>> >>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ <skn...@berkeley.edu> >>>> wrote: >>>> >>>>> yeah, i can't do much for flaky tests... just flaky infrastructure. >>>>> >>>>> >>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>> wrote: >>>>> >>>>>> Couple of flaky tests can happen. It's usual. Seems it got better now >>>>>> at least. I will keep monitoring the builds. >>>>>> >>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 <ukby.1...@gmail.com>님이 작성: >>>>>> >>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in a >>>>>>> row: >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console >>>>>>> >>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sent from: >>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/ >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Shane Knapp >>>>> Computer Guy / Voice of Reason >>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>>>> https://rise.cs.berkeley.edu >>>>> >>>> >>> >>> -- >>> Shane Knapp >>> Computer Guy / Voice of Reason >>> UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> > > -- > Shane Knapp > Computer Guy / Voice of Reason > UC Berkeley EECS Research / RISELab Staff Technical Lead > https://rise.cs.berkeley.edu >