[
https://issues.apache.org/jira/browse/HBASE-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083491#comment-16083491
]
Mike Drob commented on HBASE-17922:
-----------------------------------
Chatted with [~appy] about this offline a bit...
It looks like the problem here is that when TestUtil fails to start a region
server, something in the JVM breaks. His concern was that even if it's a bug
with TestUtil, we might still be uncovering a real issue with Hadoop 3
integration, and maybe changing the test will go back to masking the problem.
This took me way too long to figure out because I had to wire up a bunch of
reflection to start examining HDFS internals, but I think I finally caught the
root cause here.
Here is the minimal test case that fails with the same error as we're seeing
here:
{noformat}
@Test (timeout=15000)
public void testStartStopStart() throws Exception {
TEST_UTIL.startMiniDFSCluster(1);
TEST_UTIL.shutdownMiniDFSCluster();
TEST_UTIL.startMiniCluster(1, 1);
}
{noformat}
What happens is that the first time we start up a DFS cluster, the file system
caches get populated here (line numbers likely off because of the previously
mentioned reflection hacks):
{noformat}
at
org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:210)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3318)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3275)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
at org.apache.hadoop.hbase.fs.HFileSystem.<init>(HFileSystem.java:88)
at org.apache.hadoop.hbase.fs.HFileSystem.get(HFileSystem.java:472)
at
org.apache.hadoop.hbase.HBaseTestingUtility.getTestFileSystem(HBaseTestingUtility.java:3072)
at
org.apache.hadoop.hbase.HBaseTestingUtility.getNewDataTestDirOnTestFS(HBaseTestingUtility.java:576)
at
org.apache.hadoop.hbase.HBaseTestingUtility.setupDataTestDirOnTestFS(HBaseTestingUtility.java:565)
at
org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:538)
at
org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:552)
at
org.apache.hadoop.hbase.HBaseTestingUtility.createDirsAndSetProperties(HBaseTestingUtility.java:786)
at
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:655)
{noformat}
That is also where the client finalizer shutdown hook is added, which region
servers attempt to suppress.
In normal operation, only a single region server starts per JVM so we can
suppress that hook and everything is good. In our tests, we can start and stop
multiple mini clusters, and we fix the suppression by checking to see if we
have already suppressed it. If we have then it's still registered in our own
ShutdownHookManager and we don't need to suppress it again, but we can
increment a refcount.
However, if we start and stop a DFS cluster, then that hook gets cleared on DFS
cluster shutdown.
{noformat}
at
org.apache.hadoop.util.ShutdownHookManager.clearShutdownHooks(ShutdownHookManager.java:275)
at
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1975)
at
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1944)
at
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1937)
at
org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniDFSCluster(HBaseTestingUtility.java:849)
{noformat}
The second time we start DFS, this hook doesn't get added. I haven't been able
to figure out what exactly gets reused, but the effect is that the hook isn't
there, and we don't have a copy of it that we've saved off, so the whole thing
goes boom.
This particular test was triggering the failure because the aborting
RegionServer would fail before the suppression could happen. The hook would get
cleaned up by DFS instead of by us, and later attempts to start the mini
cluster wouldn't have the hook available and their RegionServers would also
fail.
I assume that HDFS changed with version 3 to do shutdown hook cleanup in the
mini cluster, and weren't doing this before, but haven't verified that.
> TestRegionServerHostname always fails against hadoop 3.0.0-alpha2
> -----------------------------------------------------------------
>
> Key: HBASE-17922
> URL: https://issues.apache.org/jira/browse/HBASE-17922
> Project: HBase
> Issue Type: Sub-task
> Components: hadoop3
> Affects Versions: 2.0.0
> Reporter: Jonathan Hsieh
> Assignee: Mike Drob
> Fix For: 2.0.0-alpha-2
>
> Attachments: HBASE-17922.patch
>
>
> {code}
> Running org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 126.363 sec
> <<< FAILURE! - in
> org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> testRegionServerHostname(org.apache.hadoop.hbase.regionserver.TestRegionServerHostname)
> Time elapsed: 120.029 sec <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 120000
> milliseconds
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:221)
> at
> org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:405)
> at
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:225)
> at
> org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:94)
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1123)
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1077)
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:948)
> at
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:942)
> at
> org.apache.hadoop.hbase.regionserver.TestRegionServerHostname.testRegionServerHostname(TestRegionServerHostname.java:88)
> Results :
> Tests in error:
> TestRegionServerHostname.testRegionServerHostname:88 ยป TestTimedOut test
> timed...
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)