[ https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089612#comment-13089612 ]
Raghu Angadi commented on HIVE-2395: ------------------------------------ > .lzo files require that an LzoIndexer is run on them. This is not a requirement. You need the index file only if you want split large lzo files. You could just remove the index files as a quick workaround (in which case you might as well use just TextInputFormat ). > Misleading "No LZO codec found, cannot run." exception when using external > table and LZO / DeprecatedLzoTextInputFormat > ----------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-2395 > URL: https://issues.apache.org/jira/browse/HIVE-2395 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.7.1 > Environment: Cloudera 3u1 with > https://github.com/kevinweil/hadoop-lzo or > https://github.com/kevinweil/elephant-bird > Reporter: Vitaliy Fuks > > We have a {{/tables/}} directory containing .lzo files with our data, > compressed using lzop. > We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS > INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}. > .lzo files require that an LzoIndexer is run on them. When this is done, > .lzo.index file is created for every .lzo file, so we end up with: > {noformat} > /tables/ourdata_2011-08-19.lzo > /tables/ourdata_2011-08-19.lzo.index > /tables/ourdata_2011-08-18.lzo > /tables/ourdata_2011-08-18.lzo.index > ..etc > {noformat} > The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is > attempting to getRecordReader() for .lzo.index files. This throws a pretty > confusing exception: > {noformat} > Caused by: java.io.IOException: No LZO codec found, cannot run. > at > com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53) > at > com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128) > at > org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68) > {noformat} > More precisely, it dies on second invocation of getRecordReader() - here is > some System.out.println() output: > {noformat} > DeprecatedLzoTextInputFormat.getRecordReader(): > split=/tables/ourdata_2011-08-19.lzo:0+616479 > DeprecatedLzoTextInputFormat.getRecordReader(): > split=/tables/ourdata_2011-08-19.lzo.index:0+64 > {noformat} > DeprecatedLzoTextInputFormat contains the following code which causes the > ultimate exception and death of query, as it obviously doesn't have a codec > to read .lzo.index files. > {noformat} > final CompressionCodec codec = codecFactory.getCodec(file); > if (codec == null) { > throw new IOException("No LZO codec found, cannot run."); > } > {noformat} > So I understand that the way things are right now is that Hive considers all > files within a directory to be part of a table. There is an open patch > HIVE-951 which would allow a quick workaround for this problem. > Does it make sense to add some hooks so that CombineHiveRecordReader or its > parents are more aware of what files should be considered instead of blindly > trying to read everything? > Any suggestions for a quick workaround to make it skip .index files? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira