If all it takes is someone creating a test that makes a directory without -x, this is going to happen over and over.
Let's just fix the problem at the root by running "git clean -fqdx" in our jenkins scripts. If there's no objections I will add this in and un-break the builds. best, Colin On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <l...@cloudera.com> wrote: > I filed HDFS-7917 to change the way to simulate disk failures. > > But I think we still need infrastructure folks to help with jenkins > scripts to clean the dirs left today. > > On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricet...@gmail.com> wrote: >> Any updates on this issues? It seems that all HDFS jenkins builds are >> still failing. >> >> Regards, >> Haohui >> >> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <vinayakum...@apache.org> >> wrote: >>> I think the problem started from here. >>> >>> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/ >>> >>> As Chris mentioned TestDataNodeVolumeFailure is changing the permission. >>> But in this patch, ReplicationMonitor got NPE and it got terminate signal, >>> due to which MiniDFSCluster.shutdown() throwing Exception. >>> >>> But, TestDataNodeVolumeFailure#teardown() is restoring those permission >>> after shutting down cluster. So in this case IMO, permissions were never >>> restored. >>> >>> >>> @After >>> public void tearDown() throws Exception { >>> if(data_fail != null) { >>> FileUtil.setWritable(data_fail, true); >>> } >>> if(failedDir != null) { >>> FileUtil.setWritable(failedDir, true); >>> } >>> if(cluster != null) { >>> cluster.shutdown(); >>> } >>> for (int i = 0; i < 3; i++) { >>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true); >>> FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true); >>> } >>> } >>> >>> >>> Regards, >>> Vinay >>> >>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B <vinayakum...@apache.org> >>> wrote: >>> >>>> When I see the history of these kind of builds, All these are failed on >>>> node H9. >>>> >>>> I think some or the other uncommitted patch would have created the problem >>>> and left it there. >>>> >>>> >>>> Regards, >>>> Vinay >>>> >>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <bus...@cloudera.com> wrote: >>>> >>>>> You could rely on a destructive git clean call instead of maven to do the >>>>> directory removal. >>>>> >>>>> -- >>>>> Sean >>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> wrote: >>>>> >>>>> > Is there a maven plugin or setting we can use to simply remove >>>>> > directories that have no executable permissions on them? Clearly we >>>>> > have the permission to do this from a technical point of view (since >>>>> > we created the directories as the jenkins user), it's simply that the >>>>> > code refuses to do it. >>>>> > >>>>> > Otherwise I guess we can just fix those tests... >>>>> > >>>>> > Colin >>>>> > >>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <l...@cloudera.com> wrote: >>>>> > > Thanks a lot for looking into HDFS-7722, Chris. >>>>> > > >>>>> > > In HDFS-7722: >>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions in >>>>> > TearDown(). >>>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally clause. >>>>> > > >>>>> > > Also I ran mvn test several times on my machine and all tests passed. >>>>> > > >>>>> > > However, since in DiskChecker#checkDirAccess(): >>>>> > > >>>>> > > private static void checkDirAccess(File dir) throws >>>>> DiskErrorException { >>>>> > > if (!dir.isDirectory()) { >>>>> > > throw new DiskErrorException("Not a directory: " >>>>> > > + dir.toString()); >>>>> > > } >>>>> > > >>>>> > > checkAccessByFileMethods(dir); >>>>> > > } >>>>> > > >>>>> > > One potentially safer alternative is replacing data dir with a regular >>>>> > > file to stimulate disk failures. >>>>> > > >>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth < >>>>> cnaur...@hortonworks.com> >>>>> > wrote: >>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure, >>>>> > >> TestDataNodeVolumeFailureReporting, and >>>>> > >> TestDataNodeVolumeFailureToleration all remove executable permissions >>>>> > from >>>>> > >> directories like the one Colin mentioned to simulate disk failures at >>>>> > data >>>>> > >> nodes. I reviewed the code for all of those, and they all appear to >>>>> be >>>>> > >> doing the necessary work to restore executable permissions at the >>>>> end of >>>>> > >> the test. The only recent uncommitted patch I¹ve seen that makes >>>>> > changes >>>>> > >> in these test suites is HDFS-7722. That patch still looks fine >>>>> > though. I >>>>> > >> don¹t know if there are other uncommitted patches that changed these >>>>> > test >>>>> > >> suites. >>>>> > >> >>>>> > >> I suppose it¹s also possible that the JUnit process unexpectedly died >>>>> > >> after removing executable permissions but before restoring them. >>>>> That >>>>> > >> always would have been a weakness of these test suites, regardless of >>>>> > any >>>>> > >> recent changes. >>>>> > >> >>>>> > >> Chris Nauroth >>>>> > >> Hortonworks >>>>> > >> http://hortonworks.com/ >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <a...@cloudera.com> wrote: >>>>> > >> >>>>> > >>>Hey Colin, >>>>> > >>> >>>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's going on >>>>> with >>>>> > >>>these boxes. He took a look and concluded that some perms are being >>>>> set >>>>> > in >>>>> > >>>those directories by our unit tests which are precluding those files >>>>> > from >>>>> > >>>getting deleted. He's going to clean up the boxes for us, but we >>>>> should >>>>> > >>>expect this to keep happening until we can fix the test in question >>>>> to >>>>> > >>>properly clean up after itself. >>>>> > >>> >>>>> > >>>To help narrow down which commit it was that started this, Andrew >>>>> sent >>>>> > me >>>>> > >>>this info: >>>>> > >>> >>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS- >>>>> > >>>>> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/ >>>>> > has >>>>> > >>>500 perms, so I'm guessing that's the problem. Been that way since >>>>> 9:32 >>>>> > >>>UTC >>>>> > >>>on March 5th." >>>>> > >>> >>>>> > >>>-- >>>>> > >>>Aaron T. Myers >>>>> > >>>Software Engineer, Cloudera >>>>> > >>> >>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe <cmcc...@apache.org >>>>> > >>>>> > >>>wrote: >>>>> > >>> >>>>> > >>>> Hi all, >>>>> > >>>> >>>>> > >>>> A very quick (and not thorough) survey shows that I can't find any >>>>> > >>>> jenkins jobs that succeeded from the last 24 hours. Most of them >>>>> seem >>>>> > >>>> to be failing with some variant of this message: >>>>> > >>>> >>>>> > >>>> [ERROR] Failed to execute goal >>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean >>>>> (default-clean) >>>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed to delete >>>>> > >>>> >>>>> > >>>> >>>>> > >>>>> > >>>>> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr >>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3 >>>>> > >>>> -> [Help 1] >>>>> > >>>> >>>>> > >>>> Any ideas how this happened? Bad disk, unit test setting wrong >>>>> > >>>> permissions? >>>>> > >>>> >>>>> > >>>> Colin >>>>> > >>>> >>>>> > >> >>>>> > > >>>>> > > >>>>> > > >>>>> > > -- >>>>> > > Lei (Eddy) Xu >>>>> > > Software Engineer, Cloudera >>>>> > >>>>> >>>> >>>> > > > > -- > Lei (Eddy) Xu > Software Engineer, Cloudera