Misleading "No LZO codec found, cannot run." exception when using external 
table and LZO / DeprecatedLzoTextInputFormat
-----------------------------------------------------------------------------------------------------------------------

                 Key: HIVE-2395
                 URL: https://issues.apache.org/jira/browse/HIVE-2395
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.7.1
         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo 
or https://github.com/kevinweil/elephant-bird
            Reporter: Vitaliy Fuks


We have a {{/tables/}} directory containing .lzo files with our data, 
compressed using lzop.

We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS 
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.

.lzo files require that an LzoIndexer is run on them. When this is done, 
.lzo.index file is created for every .lzo file, so we end up with:

{noformat}
/tables/ourdata_2011-08-19.lzo
/tables/ourdata_2011-08-19.lzo.index
/tables/ourdata_2011-08-18.lzo
/tables/ourdata_2011-08-18.lzo.index
..etc
{noformat}

The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is 
attempting to getRecordReader() for .lzo.index files. This throws a pretty 
confusing exception:

{noformat}
Caused by: java.io.IOException: No LZO codec found, cannot run.
        at 
com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
        at 
com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
        at 
org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
{noformat}

More precisely, it dies on second invocation of getRecordReader() - here is 
some System.out.println() output:

{noformat}
DeprecatedLzoTextInputFormat.getRecordReader(): 
split=/tables/ourdata_2011-08-19.lzo:0+616479
DeprecatedLzoTextInputFormat.getRecordReader(): 
split=/tables/ourdata_2011-08-19.lzo.index:0+64
{noformat}

DeprecatedLzoTextInputFormat contains the following code which causes the 
ultimate exception and death of query, as it obviously doesn't have a codec to 
read .lzo.index files.

{noformat}
    final CompressionCodec codec = codecFactory.getCodec(file);
    if (codec == null) {
      throw new IOException("No LZO codec found, cannot run.");
    }
{noformat}

So I understand that the way things are right now is that Hive considers all 
files within a directory to be part of a table. There is an open patch HIVE-951 
which would allow a quick workaround for this problem.

Does it make sense to add some hooks so that CombineHiveRecordReader or its 
parents are more aware of what files should be considered instead of blindly 
trying to read everything?

Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to