I think this is hbase-4168 whose fix got committed today. We run 0.90.4 patched with hbase-4168.
Please try the patch. On Aug 10, 2011, at 7:05 PM, Gaojinchao <[email protected]> wrote: > In my cluster(version 0.90.3) , The root table couldn't be opened when one > region server crashed because of gc. > > The logs show: > > // Master assigned the root table to 82 > 2011-07-28 21:34:34,710 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > -ROOT-,,0.70236052 on 158-1-101-82,20020,1311885942386 > > //The host of 82 crashed, master finished the split log and reassigned the > root and meta. But the region server didn't exit. So the root verified is > passed. > I think we shouldn't verify the root / meta in shutdownhandler processing > > 2011-07-28 22:19:53,746 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=158-1-101-82,20020,1311885942386 to dead servers, submitted shutdown > handler to be executed, root=true, meta=true > 2011-07-28 22:25:10,085 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting > completed in 316329 ms for > hdfs://158.1.101.82:9000/hbase/.logs/158-1-101-82,20020,1311885942386 > 2011-07-28 22:26:54,790 DEBUG org.apache.hadoop.hbase.master.HMaster: Not > running balancer because processing dead regionserver(s): > [158-1-101-82,20020,1311885942386] > 2011-07-28 22:27:11,176 WARN org.apache.hadoop.hbase.master.CatalogJanitor: > Failed scan of catalog table > java.net.SocketTimeoutException: Call to 158-1-101-82/158.1.101.82:20020 > failed on socket timeout exception: java.net.SocketTimeoutException: 60000 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 > remote=158-1-101-82/158.1.101.82:20020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy6.getRegionInfo(Unknown Source) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:426) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:273) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:333) > at > org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) > at > org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255) > at > org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237) > at > org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:139) > at > org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:88) > at org.apache.hadoop.hbase.Chore.run(Chore.java:66) > Caused by: java.net.SocketTimeoutException: 60000 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/158.1.101.82:57428 > remote=158-1-101-82/158.1.101.82:20020] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477) > 2011-07-28 22:28:30,577 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Server REPORT rejected; currently processing 158-1-101-82,20020,1311885942386 > as dead server > 2011-07-28 22:28:37,591 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:20000-0x23171e103d10018 Creating (or updating) unassigned node for > 1028785192 with OFFLINE state > 2011-07-28 22:28:37,704 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan > was found (or we are ignoring an existing plan) for .META.,,1.1028785192 so > generated a random one; hri=.META.,,1.1028785192, src=, > dest=158-1-101-202,20020,1311878322145; 2 (online=2, exclude=null) available > servers > 2011-07-28 22:28:37,704 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > .META.,,1.1028785192 to 158-1-101-202,20020,1311878322145 > 2011-07-28 22:28:37,733 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=M_ZK_REGION_OFFLINE, server=158-1-101-82:20000, > region=1028785192/.META. > 2011-07-28 22:28:37,766 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENING, server=158-1-101-202,20020,1311878322145, > region=1028785192/.META. > > > > Region server logs: > 2011-07-28 22:19:17,389 DEBUG > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested > for ufdr7,,1311890810267.c1b556627c511076bafbb1b802589cb6. because > regionserver20020.cacheFlusher; priority=4, compaction queue size=1 > > // blocked for a long time. > 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 552455ms for sessionid > 0x13171e103d7003e, closing socket connection and attempting reconnect > 2011-07-28 22:28:24,829 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 552455ms for sessionid > 0x23171e103d10020, closing socket connection and attempting reconnect > 2011-07-28 22:28:25,186 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server /158.1.101.222:2181 > 2011-07-28 22:28:25,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server /158.1.101.82:2181 >
