If all it takes is someone creating a test that makes a directory
without -x, this is going to happen over and over.

Let's just fix the problem at the root by running "git clean -fqdx" in
our jenkins scripts.  If there's no objections I will add this in and
un-break the builds.

best,
Colin

On Fri, Mar 13, 2015 at 1:48 PM, Lei Xu <l...@cloudera.com> wrote:
> I filed HDFS-7917 to change the way to simulate disk failures.
>
> But I think we still need infrastructure folks to help with jenkins
> scripts to clean the dirs left today.
>
> On Fri, Mar 13, 2015 at 1:38 PM, Mai Haohui <ricet...@gmail.com> wrote:
>> Any updates on this issues? It seems that all HDFS jenkins builds are
>> still failing.
>>
>> Regards,
>> Haohui
>>
>> On Thu, Mar 12, 2015 at 12:53 AM, Vinayakumar B <vinayakum...@apache.org> 
>> wrote:
>>> I think the problem started from here.
>>>
>>> https://builds.apache.org/job/PreCommit-HDFS-Build/9828/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailure/testUnderReplicationAfterVolFailure/
>>>
>>> As Chris mentioned TestDataNodeVolumeFailure is changing the permission.
>>> But in this patch, ReplicationMonitor got NPE and it got terminate signal,
>>> due to which MiniDFSCluster.shutdown() throwing Exception.
>>>
>>> But, TestDataNodeVolumeFailure#teardown() is restoring those permission
>>> after shutting down cluster. So in this case IMO, permissions were never
>>> restored.
>>>
>>>
>>>   @After
>>>   public void tearDown() throws Exception {
>>>     if(data_fail != null) {
>>>       FileUtil.setWritable(data_fail, true);
>>>     }
>>>     if(failedDir != null) {
>>>       FileUtil.setWritable(failedDir, true);
>>>     }
>>>     if(cluster != null) {
>>>       cluster.shutdown();
>>>     }
>>>     for (int i = 0; i < 3; i++) {
>>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+1)), true);
>>>       FileUtil.setExecutable(new File(dataDir, "data"+(2*i+2)), true);
>>>     }
>>>   }
>>>
>>>
>>> Regards,
>>> Vinay
>>>
>>> On Thu, Mar 12, 2015 at 12:35 PM, Vinayakumar B <vinayakum...@apache.org>
>>> wrote:
>>>
>>>> When I see the history of these kind of builds, All these are failed on
>>>> node H9.
>>>>
>>>> I think some or the other uncommitted patch would have created the problem
>>>> and left it there.
>>>>
>>>>
>>>> Regards,
>>>> Vinay
>>>>
>>>> On Thu, Mar 12, 2015 at 6:16 AM, Sean Busbey <bus...@cloudera.com> wrote:
>>>>
>>>>> You could rely on a destructive git clean call instead of maven to do the
>>>>> directory removal.
>>>>>
>>>>> --
>>>>> Sean
>>>>> On Mar 11, 2015 4:11 PM, "Colin McCabe" <cmcc...@alumni.cmu.edu> wrote:
>>>>>
>>>>> > Is there a maven plugin or setting we can use to simply remove
>>>>> > directories that have no executable permissions on them?  Clearly we
>>>>> > have the permission to do this from a technical point of view (since
>>>>> > we created the directories as the jenkins user), it's simply that the
>>>>> > code refuses to do it.
>>>>> >
>>>>> > Otherwise I guess we can just fix those tests...
>>>>> >
>>>>> > Colin
>>>>> >
>>>>> > On Tue, Mar 10, 2015 at 2:43 PM, Lei Xu <l...@cloudera.com> wrote:
>>>>> > > Thanks a lot for looking into HDFS-7722, Chris.
>>>>> > >
>>>>> > > In HDFS-7722:
>>>>> > > TestDataNodeVolumeFailureXXX tests reset data dir permissions in
>>>>> > TearDown().
>>>>> > > TestDataNodeHotSwapVolumes reset permissions in a finally clause.
>>>>> > >
>>>>> > > Also I ran mvn test several times on my machine and all tests passed.
>>>>> > >
>>>>> > > However, since in DiskChecker#checkDirAccess():
>>>>> > >
>>>>> > > private static void checkDirAccess(File dir) throws
>>>>> DiskErrorException {
>>>>> > >   if (!dir.isDirectory()) {
>>>>> > >     throw new DiskErrorException("Not a directory: "
>>>>> > >                                  + dir.toString());
>>>>> > >   }
>>>>> > >
>>>>> > >   checkAccessByFileMethods(dir);
>>>>> > > }
>>>>> > >
>>>>> > > One potentially safer alternative is replacing data dir with a regular
>>>>> > > file to stimulate disk failures.
>>>>> > >
>>>>> > > On Tue, Mar 10, 2015 at 2:19 PM, Chris Nauroth <
>>>>> cnaur...@hortonworks.com>
>>>>> > wrote:
>>>>> > >> TestDataNodeHotSwapVolumes, TestDataNodeVolumeFailure,
>>>>> > >> TestDataNodeVolumeFailureReporting, and
>>>>> > >> TestDataNodeVolumeFailureToleration all remove executable permissions
>>>>> > from
>>>>> > >> directories like the one Colin mentioned to simulate disk failures at
>>>>> > data
>>>>> > >> nodes.  I reviewed the code for all of those, and they all appear to
>>>>> be
>>>>> > >> doing the necessary work to restore executable permissions at the
>>>>> end of
>>>>> > >> the test.  The only recent uncommitted patch I¹ve seen that makes
>>>>> > changes
>>>>> > >> in these test suites is HDFS-7722.  That patch still looks fine
>>>>> > though.  I
>>>>> > >> don¹t know if there are other uncommitted patches that changed these
>>>>> > test
>>>>> > >> suites.
>>>>> > >>
>>>>> > >> I suppose it¹s also possible that the JUnit process unexpectedly died
>>>>> > >> after removing executable permissions but before restoring them.
>>>>> That
>>>>> > >> always would have been a weakness of these test suites, regardless of
>>>>> > any
>>>>> > >> recent changes.
>>>>> > >>
>>>>> > >> Chris Nauroth
>>>>> > >> Hortonworks
>>>>> > >> http://hortonworks.com/
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >>
>>>>> > >> On 3/10/15, 1:47 PM, "Aaron T. Myers" <a...@cloudera.com> wrote:
>>>>> > >>
>>>>> > >>>Hey Colin,
>>>>> > >>>
>>>>> > >>>I asked Andrew Bayer, who works with Apache Infra, what's going on
>>>>> with
>>>>> > >>>these boxes. He took a look and concluded that some perms are being
>>>>> set
>>>>> > in
>>>>> > >>>those directories by our unit tests which are precluding those files
>>>>> > from
>>>>> > >>>getting deleted. He's going to clean up the boxes for us, but we
>>>>> should
>>>>> > >>>expect this to keep happening until we can fix the test in question
>>>>> to
>>>>> > >>>properly clean up after itself.
>>>>> > >>>
>>>>> > >>>To help narrow down which commit it was that started this, Andrew
>>>>> sent
>>>>> > me
>>>>> > >>>this info:
>>>>> > >>>
>>>>> > >>>"/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-
>>>>> >
>>>>> >>>Build/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data3/
>>>>> > has
>>>>> > >>>500 perms, so I'm guessing that's the problem. Been that way since
>>>>> 9:32
>>>>> > >>>UTC
>>>>> > >>>on March 5th."
>>>>> > >>>
>>>>> > >>>--
>>>>> > >>>Aaron T. Myers
>>>>> > >>>Software Engineer, Cloudera
>>>>> > >>>
>>>>> > >>>On Tue, Mar 10, 2015 at 1:24 PM, Colin P. McCabe <cmcc...@apache.org
>>>>> >
>>>>> > >>>wrote:
>>>>> > >>>
>>>>> > >>>> Hi all,
>>>>> > >>>>
>>>>> > >>>> A very quick (and not thorough) survey shows that I can't find any
>>>>> > >>>> jenkins jobs that succeeded from the last 24 hours.  Most of them
>>>>> seem
>>>>> > >>>> to be failing with some variant of this message:
>>>>> > >>>>
>>>>> > >>>> [ERROR] Failed to execute goal
>>>>> > >>>> org.apache.maven.plugins:maven-clean-plugin:2.5:clean
>>>>> (default-clean)
>>>>> > >>>> on project hadoop-hdfs: Failed to clean project: Failed to delete
>>>>> > >>>>
>>>>> > >>>>
>>>>> >
>>>>> >
>>>>> >>>>/home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/hadoop-hdfs-pr
>>>>> > >>>>oject/hadoop-hdfs/target/test/data/dfs/data/data3
>>>>> > >>>> -> [Help 1]
>>>>> > >>>>
>>>>> > >>>> Any ideas how this happened?  Bad disk, unit test setting wrong
>>>>> > >>>> permissions?
>>>>> > >>>>
>>>>> > >>>> Colin
>>>>> > >>>>
>>>>> > >>
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > > Lei (Eddy) Xu
>>>>> > > Software Engineer, Cloudera
>>>>> >
>>>>>
>>>>
>>>>
>
>
>
> --
> Lei (Eddy) Xu
> Software Engineer, Cloudera

Reply via email to