Re: H9 build slave is bad

2017-09-29 Thread Xiao Chen
Running into this again, filed https://issues.apache.org/jira/browse/INFRA-15194 Noticed some of the HDFS jobs succeeding even the job itself failed and says 'see HADOOP-13591' in the end. Though not my run wasn't that luck. :( -Xiao On Fri, Mar 10, 2017 at 10:19 AM, Sean Busbey wrote: > All t

Re: H9 build slave is bad

2017-03-10 Thread Sean Busbey
All the precommit builds should be doing the correct thing now for making sure we don't render nodes useless. They don't flag the problem yet and someone will still need to run the "cleanup" job on nodes broken before jenkins runs pick up the new configuration changes. Probably best if we move to

Re: H9 build slave is bad

2017-03-09 Thread Allen Wittenauer
> On Mar 9, 2017, at 2:15 PM, Andrew Wang wrote: > > H9 is again eating our builds. > H0: https://builds.apache.org/job/PreCommit-HDFS-Build/18652/console H6: https://builds.apache.org/job/PreCommit-HDFS-Build/18646/console

Re: H9 build slave is bad

2017-03-09 Thread Andrew Wang
H9 is again eating our builds. I'm going to do the easy hack of removing it from HDFS precommit for now, pending HADOOP-13951 being resolved. On Thu, Mar 9, 2017 at 6:21 AM, Sean Busbey wrote: > On Wed, Mar 8, 2017 at 2:04 PM, Allen Wittenauer > wrote: > > > >> On Mar 8, 2017, at 9:34 AM, Sean

Re: H9 build slave is bad

2017-03-09 Thread Sean Busbey
On Wed, Mar 8, 2017 at 2:04 PM, Allen Wittenauer wrote: > >> On Mar 8, 2017, at 9:34 AM, Sean Busbey wrote: >> >> Is this HADOOP-13951? > > Almost certainly. Here's the run that broke it again: > > https://builds.apache.org/job/PreCommit-HDFS-Build/18591 > > Likely something in t

Re: H9 build slave is bad

2017-03-08 Thread Allen Wittenauer
> On Mar 8, 2017, at 2:53 PM, Anu Engineer wrote: > > Agreed, but I was under the impression that we would kill the container under > OOM conditions and not the whole base machine. We do not run our docker containers under a cgroup. --

Re: H9 build slave is bad

2017-03-08 Thread Anu Engineer
Agreed, but I was under the impression that we would kill the container under OOM conditions and not the whole base machine. Thanks Anu On 3/8/17, 2:41 PM, "Allen Wittenauer" wrote: > >> On Mar 8, 2017, at 2:21 PM, Anu Engineer wrote: >> >> Hi Allen, >>> Likely something in the HDFS-724

Re: H9 build slave is bad

2017-03-08 Thread Allen Wittenauer
> On Mar 8, 2017, at 2:21 PM, Anu Engineer wrote: > > Hi Allen, >> Likely something in the HDFS-7240 branch or with this patch that's >> doing Bad Things (tm). > > Thanks for bringing this to my attention, But I am surprised that a mvn > command is able to kill a test machine. F

Re: H9 build slave is bad

2017-03-08 Thread Anu Engineer
Hi Allen, > Likely something in the HDFS-7240 branch or with this patch that's > doing Bad Things (tm). Thanks for bringing this to my attention, But I am surprised that a mvn command is able to kill a test machine. I have pasted the call stack from the issue that you pointed out to be th

Re: H9 build slave is bad

2017-03-08 Thread Allen Wittenauer
> On Mar 8, 2017, at 12:04 PM, Allen Wittenauer > wrote: > > >> On Mar 8, 2017, at 9:34 AM, Sean Busbey wrote: >> >> Is this HADOOP-13951? > > Almost certainly. Here's the run that broke it again: > > https://builds.apache.org/job/PreCommit-HDFS-Build/18591 > > Likely somethi

Re: H9 build slave is bad

2017-03-08 Thread Allen Wittenauer
> On Mar 8, 2017, at 9:34 AM, Sean Busbey wrote: > > Is this HADOOP-13951? Almost certainly. Here's the run that broke it again: https://builds.apache.org/job/PreCommit-HDFS-Build/18591 Likely something in the HDFS-7240 branch or with this patch that's doing Bad Things (tm).

Re: H9 build slave is bad

2017-03-08 Thread Sean Busbey
Is this HADOOP-13951? On Tue, Mar 7, 2017 at 8:32 PM, Andrew Wang wrote: > A little ping that H9 hit the same error again, and I'm again going to > clean it out. One more time and I'll ask infra about either removing or > reimaging this node. > > On Mon, Mar 6, 2017 at 2:12 PM, Allen Wittenauer

Re: H9 build slave is bad

2017-03-07 Thread Andrew Wang
A little ping that H9 hit the same error again, and I'm again going to clean it out. One more time and I'll ask infra about either removing or reimaging this node. On Mon, Mar 6, 2017 at 2:12 PM, Allen Wittenauer wrote: > > > On Mar 6, 2017, at 1:57 PM, Andrew Wang > wrote: > > > > I'll leave i

Re: H9 build slave is bad

2017-03-06 Thread Allen Wittenauer
> On Mar 6, 2017, at 1:57 PM, Andrew Wang wrote: > > I'll leave it there so it's ready for next time. If this keeps happening on > H9, then I'm going to ask infra to reimage it. FWIW I haven't seen this on > our internal unit test runs, so it points to an H9-specific issue. I’ve seen

Re: H9 build slave is bad

2017-03-06 Thread Andrew Wang
Thanks Allen. I wrote this little job that does what we want: https://builds.apache.org/view/H-L/view/Hadoop/job/hadoop-clean-h9/ The bad directory is some NN metadata dir, which could come from basically any minicluster test. I'll leave it there so it's ready for next time. If this keeps happen

Re: H9 build slave is bad

2017-03-06 Thread Andrew Wang
I also found this older JIRA I filed for H9. Either this box is suspect, or we have a disproportionate number of our builds running on it. https://issues.apache.org/jira/browse/INFRA-13234 On Mon, Mar 6, 2017 at 1:17 PM, Andrew Wang wrote: > Do you have a link to your old job somewhere? > > I'm

Re: H9 build slave is bad

2017-03-06 Thread Allen Wittenauer
> On Mar 6, 2017, at 1:17 PM, Andrew Wang wrote: > > Do you have a link to your old job somewhere? Nope, but it’s trivial to write. single job that only runs on H9 that removes that other job’s workspace dir. You can also try using the “Wipe out current workspace” button. > I'm als

Re: H9 build slave is bad

2017-03-06 Thread Andrew Wang
Do you have a link to your old job somewhere? I'm also wondering what causes this; does this issue surface in the same way each time? Also wondering, should we nuke the workspace before every run, for improved reliability? On Mon, Mar 6, 2017 at 1:08 PM, Allen Wittenauer wrote: > > > On Mar 6,

Re: H9 build slave is bad

2017-03-06 Thread Allen Wittenauer
> On Mar 6, 2017, at 11:27 AM, Andrew Wang wrote: > Looks like H9 is having problems cleaning the workspace, leading to a lot > of silent precommit failures. I filed this INFRA JIRA: > https://issues.apache.org/jira/browse/INFRA-13618 Have we tried writing a job that nukes the workspace on that

H9 build slave is bad

2017-03-06 Thread Andrew Wang
Hi folks, Looks like H9 is having problems cleaning the workspace, leading to a lot of silent precommit failures. I filed this INFRA JIRA: https://issues.apache.org/jira/browse/INFRA-13618 It's quite possible you'll have to retrigger pending precommit runs, the HDFS runs are pretty red. Best, An