Prasanth Jayachandran created HIVE-11592:
--------------------------------------------

             Summary: ORC metadata section can sometimes exceed protobuf 
message size limit
                 Key: HIVE-11592
                 URL: https://issues.apache.org/jira/browse/HIVE-11592
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.3.0, 2.0.0
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran


If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
        at 
com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
        at 
com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
        at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
        at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
        at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
        at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}

The only solution for this is to programmatically increase the CodeInputStream 
size limit. We should make this configurable via hive config so that the orc 
file is readable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to