Can someone point me to an example build that is broken? On Mon, Mar 16, 2015 at 3:52 PM, Sean Busbey <bus...@cloudera.com> wrote:
> I'm on it. HADOOP-11721 > > On Mon, Mar 16, 2015 at 3:44 PM, Haohui Mai <whe...@apache.org> wrote: > >> +1 for git clean. >> >> Colin, can you please get it in ASAP? Currently due to the jenkins >> issues, we cannot close the 2.7 blockers. >> >> Thanks, >> Haohui >> >> >> >> On Mon, Mar 16, 2015 at 11:54 AM, Colin P. McCabe <cmcc...@apache.org> >> wrote: >> > If all it takes is someone creating a test that makes a directory >> > without -x, this is going to happen over and over. >> > >> > Let's just fix the problem at the root by running "git clean -fqdx" in >> > our jenkins scripts. If there's no objections I will add this in and >> > un-break the builds. >> > >> > best, >> > Colin >> > >> > On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <l...@cloudera.com> wrote: >> >> I filed HDFS-7917 to change the way to simulate disk failures. >> >> >> >> But I think we still need infrastructure folks to help with jenkins >> >> scripts to clean the dirs left today. >> >> >> >> On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricet...@gmail.com> >> wrote: >> >>> Any updates on this issues? It seems that all HDFS jenkins builds are >> >>> still failing. >> >>> >> >>> Regards, >> >>> Haohui >> >>> >> >>> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B < >> vinayakum...@apache.org> wrote: >> >>>> I think the problem started from here. >> >>>> >> >>>> >> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/ >> >>>> >> >>>> As Chris mentioned TestDataNodeVolumeFailure is changing the >> permission. >> >>>> But in this patch, ReplicationMonitor got NPE and it got terminate >> signal, >> >>>> due to which MiniDFSCluster.shutdown() throwing Exception. >> >>>> >> >>>> But, TestDataNodeVolumeFailure#teardown() is restoring those >> permission >> >>>> after shutting down cluster. So in this case IMO, permissions were >> never >> >>>> restored. >> >>>> >> >>>> >> >>>> @After >> >>>> public void tearDown() throws Exception { >> >>>> if(data_fail != null) { >> >>>> FileUtil.setWritable(data_fail, true); >> >>>> } >> >>>> if(failedDir != null) { >> >>>> FileUtil.setWritable(failedDir, true); >> >>>> } >> >>>> if(cluster != null) { >> >>>> cluster.shutdown(); >> >>>> } >> >>>> for (int i = 0; i < 3; i++) { >> >>>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), >> true); >> >>>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), >> true); >> >>>> } >> >>>> } >> >>>> >> >>>> >> >>>> Regards, >> >>>> Vinay >> >>>> >> >>>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B < >> vinayakum...@apache.org> >> >>>> wrote: >> >>>> >> >>>>> When I see the history of these kind of builds, All these are >> failed on >> >>>>> node H9. >> >>>>> >> >>>>> I think some or the other uncommitted patch would have created the >> problem >> >>>>> and left it there. >> >>>>> >> >>>>> >> >>>>> Regards, >> >>>>> Vinay >> >>>>> >> >>>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <bus...@cloudera.com> >> wrote: >> >>>>> >> >>>>>> You could rely on a destructive git clean call instead of maven to >> do the >> >>>>>> directory removal. >> >>>>>> >> >>>>>> -- >> >>>>>> Sean >> >>>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> >> wrote: >> >>>>>> >> >>>>>> > Is there a maven plugin or setting we can use to simply remove >> >>>>>> > directories that have no executable permissions on them? >> Clearly we >> >>>>>> > have the permission to do this from a technical point of view >> (since >> >>>>>> > we created the directories as the jenkins user), it's simply >> that the >> >>>>>> > code refuses to do it. >> >>>>>> > >> >>>>>> > Otherwise I guess we can just fix those tests... >> >>>>>> > >> >>>>>> > Colin >> >>>>>> > >> >>>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <l...@cloudera.com> >> wrote: >> >>>>>> > > Thanks a lot for looking into HDFS-7722, Chris. >> >>>>>> > > >> >>>>>> > > In HDFS-7722: >> >>>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions >> in >> >>>>>> > TearDown(). >> >>>>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally >> clause. >> >>>>>> > > >> >>>>>> > > Also I ran mvn test several times on my machine and all tests >> passed. >> >>>>>> > > >> >>>>>> > > However, since in DiskChecker#checkDirAccess(): >> >>>>>> > > >> >>>>>> > > private static void checkDirAccess(File dir) throws >> >>>>>> DiskErrorException { >> >>>>>> > > if (!dir.isDirectory()) { >> >>>>>> > > throw new DiskErrorException("Not a directory: " >> >>>>>> > > + dir.toString()); >> >>>>>> > > } >> >>>>>> > > >> >>>>>> > > checkAccessByFileMethods(dir); >> >>>>>> > > } >> >>>>>> > > >> >>>>>> > > One potentially safer alternative is replacing data dir with a >> regular >> >>>>>> > > file to stimulate disk failures. >> >>>>>> > > >> >>>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth < >> >>>>>> cnaur...@hortonworks.com> >> >>>>>> > wrote: >> >>>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure, >> >>>>>> > >> TestDataNodeVolumeFailureReporting, and >> >>>>>> > >> TestDataNodeVolumeFailureToleration all remove executable >> permissions >> >>>>>> > from >> >>>>>> > >> directories like the one Colin mentioned to simulate disk >> failures at >> >>>>>> > data >> >>>>>> > >> nodes. I reviewed the code for all of those, and they all >> appear to >> >>>>>> be >> >>>>>> > >> doing the necessary work to restore executable permissions at >> the >> >>>>>> end of >> >>>>>> > >> the test. The only recent uncommitted patch I¹ve seen that >> makes >> >>>>>> > changes >> >>>>>> > >> in these test suites is HDFS-7722. That patch still looks >> fine >> >>>>>> > though. I >> >>>>>> > >> don¹t know if there are other uncommitted patches that >> changed these >> >>>>>> > test >> >>>>>> > >> suites. >> >>>>>> > >> >> >>>>>> > >> I suppose it¹s also possible that the JUnit process >> unexpectedly died >> >>>>>> > >> after removing executable permissions but before restoring >> them. >> >>>>>> That >> >>>>>> > >> always would have been a weakness of these test suites, >> regardless of >> >>>>>> > any >> >>>>>> > >> recent changes. >> >>>>>> > >> >> >>>>>> > >> Chris Nauroth >> >>>>>> > >> Hortonworks >> >>>>>> > >> http://hortonworks.com/ >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> >> >>>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <a...@cloudera.com> >> wrote: >> >>>>>> > >> >> >>>>>> > >>>Hey Colin, >> >>>>>> > >>> >> >>>>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's >> going on >> >>>>>> with >> >>>>>> > >>>these boxes. He took a look and concluded that some perms are >> being >> >>>>>> set >> >>>>>> > in >> >>>>>> > >>>those directories by our unit tests which are precluding >> those files >> >>>>>> > from >> >>>>>> > >>>getting deleted. He's going to clean up the boxes for us, but >> we >> >>>>>> should >> >>>>>> > >>>expect this to keep happening until we can fix the test in >> question >> >>>>>> to >> >>>>>> > >>>properly clean up after itself. >> >>>>>> > >>> >> >>>>>> > >>>To help narrow down which commit it was that started this, >> Andrew >> >>>>>> sent >> >>>>>> > me >> >>>>>> > >>>this info: >> >>>>>> > >>> >> >>>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS- >> >>>>>> > >> >>>>>> >> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/ >> >>>>>> > has >> >>>>>> > >>>500 perms, so I'm guessing that's the problem. Been that way >> since >> >>>>>> 9:32 >> >>>>>> > >>>UTC >> >>>>>> > >>>on March 5th." >> >>>>>> > >>> >> >>>>>> > >>>-- >> >>>>>> > >>>Aaron T. Myers >> >>>>>> > >>>Software Engineer, Cloudera >> >>>>>> > >>> >> >>>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe < >> cmcc...@apache.org >> >>>>>> > >> >>>>>> > >>>wrote: >> >>>>>> > >>> >> >>>>>> > >>>> Hi all, >> >>>>>> > >>>> >> >>>>>> > >>>> A very quick (and not thorough) survey shows that I can't >> find any >> >>>>>> > >>>> jenkins jobs that succeeded from the last 24 hours. Most >> of them >> >>>>>> seem >> >>>>>> > >>>> to be failing with some variant of this message: >> >>>>>> > >>>> >> >>>>>> > >>>> [ERROR] Failed to execute goal >> >>>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean >> >>>>>> (default-clean) >> >>>>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed to >> delete >> >>>>>> > >>>> >> >>>>>> > >>>> >> >>>>>> > >> >>>>>> > >> >>>>>> >> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr >> >>>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3 >> >>>>>> > >>>> -> [Help 1] >> >>>>>> > >>>> >> >>>>>> > >>>> Any ideas how this happened? Bad disk, unit test setting >> wrong >> >>>>>> > >>>> permissions? >> >>>>>> > >>>> >> >>>>>> > >>>> Colin >> >>>>>> > >>>> >> >>>>>> > >> >> >>>>>> > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> > > -- >> >>>>>> > > Lei (Eddy) Xu >> >>>>>> > > Software Engineer, Cloudera >> >>>>>> > >> >>>>>> >> >>>>> >> >>>>> >> >> >> >> >> >> >> >> -- >> >> Lei (Eddy) Xu >> >> Software Engineer, Cloudera >> > > > > -- > Sean > -- Sean