> This is definitely a curious problem. > It's data corruption. The file is tab-separated, so I created a quick Perl pipe to print out the number of tabs on a given line:
-bash-3.2$ hadoop fs -cat /user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 | perl -pe 's/[^\t\n]//g' | perl -pe 's/\t/-/g' | sort | uniq -c The STDOUT was slightly disturbing: 1 -- 1552318 ------- The STDERR moreso: 11/05/04 11:07:49 INFO hdfs.DFSClient: No node available for block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 11/05/04 11:07:49 INFO hdfs.DFSClient: Could not obtain block blk_-1511269407958713809_10494 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 11/05/04 11:07:52 INFO hdfs.DFSClient: No node available for block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 11/05/04 11:07:52 INFO hdfs.DFSClient: Could not obtain block blk_-1511269407958713809_10494 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 11/05/04 11:07:58 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1784) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1932) at java.io.DataInputStream.read(DataInputStream.java:83) (...etc) cat: Could not obtain block: blk_-1511269407958713809_10494 file=/user/hive/warehouse/ushb/2010-10-25/data-2010-10-25 -- Tim Ellis Riot Games