Re: Fw: hadoop data loss issue discovered. please fix!

Jinsong Hu Fri, 09 Jul 2010 09:44:52 -0700

Hi, Todd:

I tried it from scratch, It is still not good. Currently I am using yahoohadoop 0.20.9 and hbase 0.20.5. I just have too much trouble with them. I amthinking of using a supported and blessed version of hadoop and hbasecombination. What is the most stable combination of the 2 ?When I send enough data to hbase, region server timeout while read/writedata. and finally the regionserver shut down itself. I have spent too muchtime on this and I am thinking of distribution.


Jinsong


--------------------------------------------------
From: "Todd Lipcon" <t...@cloudera.com>
Sent: Friday, May 21, 2010 11:14 PM
To: <hdfs-dev@hadoop.apache.org>
Subject: Re: Fw: hadoop data loss issue discovered. please fix!

Hi Jinsong,

I don't see any data loss here.

The sequence of events from the logs:

==> NN allocates block:
2010-05-18 21:21:29,731 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*

NameSystem.allocateBlock:/hbase/.META./1028785192/info/656097411976846533.

blk_5636039758999247483_31304886

===> First DN reports it has received block
2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.110.8.63:50010 is added to
blk_5636039758999247483_31304886 size 441

===> Client calls completeFile
2010-05-18 21:21:29,913 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/hbase/.META./1028785192/info/656097411976846533 is closed by
DFSClient_-919320526

===> 2nd and 3rd DN have not yet heartbeated since receiving the block, so

replication count is low, and unnecessary replication is scheduled. Thisis

a known issue - I was actually meaning to file a JIRA about it this week.

2010-05-18 21:21:31,987 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*ask

10.110.8.63:50010 to replicate blk_5636039758999247483_31304886 to
datanode(s) 10.110.8.86:50010 10.110.8.69:50010

===> Other DNs check in (within 4-5 seconds of file completion, which is
reasonable heartbeat time)
2010-05-18 21:21:33,156 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.110.8.69:50010 is added to
blk_5636039758999247483_31304886 size 441
2010-05-18 21:21:34,413 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.110.8.59:50010 is added to
blk_5636039758999247483_31304886 size 441

===> 8 seconds later the first replication goes through and cleanup of
excess replicas happens
2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.110.8.86:50010 is added to
blk_5636039758999247483_31304886 size 441
2010-05-18 21:21:41,941 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (10.110.8.63:50010,
blk_5636039758999247483_31304886) is added to recentInvalidateSets

2010-05-18 21:21:43,995 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*ask

10.110.8.63:50010 to delete  blk_5636039758999247483_31304886
blk_4349310048904429157_31304519

===> another 14 seconds later, the other replication goes through and
another excess is invalidated
2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.110.8.85:50010 is added to
blk_5636039758999247483_31304886 size 441
2010-05-18 21:21:57,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.chooseExcessReplicates: (10.110.8.69:50010,
blk_5636039758999247483_31304886) is added to recentInvalidateSets

===> about 5 minutes later, the regionserver performs a compaction andasks

the NN to delete this file
2010-05-18 21:26:39,388 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*