[
https://issues.apache.org/jira/browse/HBASE-16488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997200#comment-15997200
]
Josh Elser commented on HBASE-16488:
------------------------------------
{noformat}
@@ -2599,11 +2625,26 @@ public class HMaster extends HRegionServer implements
MasterServices, Server {
void checkNamespaceManagerReady() throws IOException {
checkInitialized();
- if (tableNamespaceManager == null ||
- !tableNamespaceManager.isTableAvailableAndInitialized(true)) {
+ if (tableNamespaceManager == null) {
throw new IOException("Table Namespace Manager not ready yet, try again
later");
+ } else if (!tableNamespaceManager.isTableAvailableAndInitialized(true)) {
+ try {
+ // Wait some time.
+ long startTime = EnvironmentEdgeManager.currentTime();
+ int timeout = conf.getInt("hbase.master.namespace.waitforready",
30000);
+ while (!tableNamespaceManager.isTableNamespaceManagerStarted() &&
+ EnvironmentEdgeManager.currentTime() - startTime < timeout) {
+ Thread.sleep(100);
+ }
+ } catch (InterruptedException e) {
+ throw (InterruptedIOException) new
InterruptedIOException().initCause(e);
+ }
+ if (!tableNamespaceManager.isTableNamespaceManagerStarted()) {
+ throw new IOException("Table Namespace Manager not fully initialized,
try again later");
+ }
}
}
{noformat}
This sits a little funny with me. Ideally, we'd have the caller do the sleeping
so that we're not blocking a thread inside of the Master (or worse an RPC
handler). Your change here is definitely easier to implement, but I wonder how
hard it would be to leave the exception throw and implement retry logic in the
callers (other methods in HMaster or hbase client).
Unrelated: shouldn't {{tableNamespaceManager}} be volatile if we're checking it
across different threads? Or, make it final and use an {{AtomicReference}}?
{noformat}
@@ -93,7 +94,7 @@ public class TableNamespaceManager {
long startTime = EnvironmentEdgeManager.currentTime();
int timeout = conf.getInt(NS_INIT_TIMEOUT, DEFAULT_NS_INIT_TIMEOUT);
while (!isTableAvailableAndInitialized(false)) {
- if (EnvironmentEdgeManager.currentTime() - startTime + 100 > timeout) {
+ if (EnvironmentEdgeManager.currentTime() - startTime > timeout) {
// We can't do anything if ns is not online.
throw new IOException("Timedout " + timeout + "ms waiting for
namespace table to " +
"be assigned");
{noformat}
Do you know of the reason we were previously augmenting this "runtime" by 100ms?
{noformat}
diff --git
hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
index f60be66..c75d4bc 100644
--- hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
+++ hbase-server/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
@@ -105,6 +105,7 @@ import org.apache.hadoop.hbase.security.HBaseKerberosUtils;
import org.apache.hadoop.hbase.security.User;
import org.apache.hadoop.hbase.security.visibility.VisibilityLabelsCache;
import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hadoop.hbase.util.EnvironmentEdgeManager;
import org.apache.hadoop.hbase.util.FSTableDescriptors;
import org.apache.hadoop.hbase.util.FSUtils;
import org.apache.hadoop.hbase.util.JVMClusterUtil;
@@ -1459,6 +1460,7 @@ public class HBaseTestingUtility extends
HBaseCommonTestingUtility {
.setMaxVersions(numVersions);
desc.addFamily(hcd);
}
+ waitUntilTableNamespaceManagerStarted();
getHBaseAdmin().createTable(desc, startKey, endKey, numRegions);
// HBaseAdmin only waits for regions to appear in hbase:meta we should
wait until they are assigned
waitUntilAllRegionsAssigned(tableName);
@@ -1497,6 +1499,7 @@ public class HBaseTestingUtility extends
HBaseCommonTestingUtility {
hcd.setBloomFilterType(BloomType.NONE);
htd.addFamily(hcd);
}
+ waitUntilTableNamespaceManagerStarted();
getHBaseAdmin().createTable(htd, splitKeys);
// HBaseAdmin only waits for regions to appear in hbase:meta we should
wait until they are
// assigned
{noformat}
Do this once in {{MiniHBaseCluster startMiniHBaseCluster(int, int, Class,
Class, boolean, boolean)}} instead of having it littered across
HBaseTestingUtility?
Nice test additions!
> Starting namespace and quota services in master startup asynchronizely
> ----------------------------------------------------------------------
>
> Key: HBASE-16488
> URL: https://issues.apache.org/jira/browse/HBASE-16488
> Project: HBase
> Issue Type: Improvement
> Components: master
> Affects Versions: 2.0.0, 1.3.0, 1.0.3, 1.4.0, 1.1.5, 1.2.2
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Attachments: HBASE-16488.v1-branch-1.patch,
> HBASE-16488.v1-master.patch, HBASE-16488.v2-branch-1.patch,
> HBASE-16488.v2-branch-1.patch, HBASE-16488.v3-branch-1.patch,
> HBASE-16488.v3-branch-1.patch, HBASE-16488.v4-branch-1.patch,
> HBASE-16488.v5-branch-1.patch, HBASE-16488.v6-branch-1.patch
>
>
> From time to time, during internal IT test and from customer, we often see
> master initialization failed due to namespace table region takes long time to
> assign (eg. sometimes split log takes long time or hanging; or sometimes RS
> is temporarily not available; sometimes due to some unknown assignment
> issue). In the past, there was some proposal to improve this situation, eg.
> HBASE-13556 / HBASE-14190 (Assign system tables ahead of user region
> assignment) or HBASE-13557 (Special WAL handling for system tables) or
> HBASE-14623 (Implement dedicated WAL for system tables).
> This JIRA proposes another way to solve this master initialization fail
> issue: namespace service is only used by a handful operations (eg. create
> table / namespace DDL / get namespace API / some RS group DDL). Only quota
> manager depends on it and quota management is off by default. Therefore,
> namespace service is not really needed for master to be functional. So we
> could start namespace service asynchronizely without blocking master startup.
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)