well, it was -08, and ssh stopped working (according to the alerts) just as i was logging in to kill off any errant processes. i've taken that worker offline in jenkins and will be rebooting it asap.
on a positive note, i was able to clear out -07 before anything horrible happened to that one. On Tue, Oct 20, 2015 at 3:46 PM, shane knapp <skn...@berkeley.edu> wrote: > amp-jenkins-worker-06 is back up. > > my next bets are on -07 and -08... :\ > > https://amplab.cs.berkeley.edu/jenkins/computer/ > > On Tue, Oct 20, 2015 at 3:39 PM, shane knapp <skn...@berkeley.edu> wrote: >> here's the related stack trace from dmesg... UID 500 is jenkins. >> >> Out of memory: Kill process 142764 (java) score 40 or sacrifice child >> Killed process 142764, UID 500, (java) total-vm:24685036kB, >> anon-rss:5730824kB, file-rss:64kB >> Uhhuh. NMI received for unknown reason 21 on CPU 0. >> Do you have a strange power saving mode enabled? >> Dazed and confused, but trying to continue >> java: page allocation failure. order:2, mode:0xd0 >> Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1 >> Call Trace: >> [<ffffffff8113770c>] ? __alloc_pages_nodemask+0x7dc/0x950 >> [<ffffffff81074fa8>] ? copy_process+0x168/0x1530 >> [<ffffffff810764c6>] ? do_fork+0x96/0x4c0 >> [<ffffffff810b828b>] ? sys_futex+0x7b/0x170 >> [<ffffffff81009598>] ? sys_clone+0x28/0x30 >> [<ffffffff8100b3f3>] ? stub_clone+0x13/0x20 >> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b >> >> On Tue, Oct 20, 2015 at 3:35 PM, shane knapp <skn...@berkeley.edu> wrote: >>> -06 just kinda came back... >>> >>> [root@amp-jenkins-worker-06 ~]# uptime >>> 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, >>> 1635.89 >>> >>> the builds that, from looking at the process table, seem to be at >>> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly >>> a Spark-Master-SBT matrix build. look at the build history here: >>> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds >>> >>> the load is dropping significantly and quickly, but swap is borked and >>> i'm still going to reboot. >>> >>> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp <skn...@berkeley.edu> wrote: >>>> starting this saturday (oct 17) we started getting alerts on the >>>> jenkins workers that various processes were dying (specifically ssh). >>>> >>>> since then, we've had half of our workers OOM due to java processes >>>> and have had now to reboot two of them (-05 and -06). >>>> >>>> if we look at the current machine that's wedged (amp-jenkins-worker-06), >>>> we see: >>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ >>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ >>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ >>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ >>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ >>>> >>>> have there been any changes to any of these builds that might have >>>> caused this? anyone have any ideas? >>>> >>>> sadly, even though i saw that -06 was about to OOM and got a shell >>>> opened before SSH died, my command prompt is completely unresponsive. >>>> :( >>>> >>>> shane --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org