In my cluster(version 0.90.3) , The root table couldn't be opened when one
region server crashed because of gc.
The logs show:
// Master assigned the root table to 82
2011-07-28 21:34:34,710 DEBUG
org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region
-ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386
//The host of 82 crashed, master finished the split log and reassigned the root
and meta. But the region server didn't exit. So the root verified is passed.
I think we shouldn't verify the root / meta in shutdownhandler processing
2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager:
Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown
handler to be executed, root=true, meta=true
2011-07-28 22:25:10,085 INFO
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting
completed in 316329 ms for
hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386
2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not
running balancer because processing dead regionserver(s):
[158-1-101-82,20020,1311885942386]
2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor:
Failed scan of catalog table
java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 failed
on socket timeout exception: java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428
remote=158-1-101-82/158.1.101.82:20020]
at
org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
at $Proxy6.getRegionInfo(Unknown Source)
at
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426)
at
org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273)
at
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333)
at
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
at
org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
at
org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
at
org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139)
at
org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88)
at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428
remote=158-1-101-82/158.1.101.82:20020]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager:
Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386
as dead server
2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:20000-0x23171e103d10018 Creating (or updating) unassigned node for
1028785192 with OFFLINE state
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
No previous transition plan was found (or we are ignoring an existing plan) for
.META.,,1.1028785192 so generated a random one; hri=.META.,,1.1028785192, src=,
dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available
servers
2011-07-28 22:28:37,704 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Assigning region .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145
2011-07-28 22:28:37,733 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000,
region=1028785192/.META.
2011-07-28 22:28:37,766 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_OPENING,
server=158-1-101-202,20020,1311878322145, region=1028785192/.META.
Region server logs:
2011-07-28 22:19:17,389 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested
for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because
regionserver20020.cacheFlusher; priority=4, compaction queue size=1
// blocked for a long time.
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 552455ms for sessionid
0x13171e103d7003e, closing socket connection and attempting reconnect
2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 552455ms for sessionid
0x23171e103d10020, closing socket connection and attempting reconnect
2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server /158.1.101.222:2181
2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server /158.1.101.82:2181