[ 
https://issues.apache.org/jira/browse/HIVE-11592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-11592:
-----------------------------------------
    Description: 
If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
        at 
com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
        at 
com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
        at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
        at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
        at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
        at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}

The only solution for this is to programmatically increase the CodeInputStream 
size limit. We should make this configurable via hive config so that the orc 
file is readable. Alternatively, we can keep increasing the size until it 
parsing succeeds.

  was:
If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
        at 
com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
        at 
com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
        at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
        at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
        at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
        at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
        at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468)
        at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
        at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
        at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}

The only solution for this is to programmatically increase the CodeInputStream 
size limit. We should make this configurable via hive config so that the orc 
file is readable. 


> ORC metadata section can sometimes exceed protobuf message size limit
> ---------------------------------------------------------------------
>
>                 Key: HIVE-11592
>                 URL: https://issues.apache.org/jira/browse/HIVE-11592
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>
> If there are too many small stripes and with many columns, the overhead for 
> storing metadata (column stats) can exceed the default protobuf message size 
> of 64MB. Reading such files will throw the following exception
> {code}
> Exception in thread "main" 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message was too 
> large.  May be malicious.  Use CodedInputStream.setSizeLimit() to increase 
> the size limit.
>         at 
> com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
>         at 
> com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
>         at 
> com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
>         at 
> com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
>         at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4887)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4803)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
>         at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12925)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.<init>(OrcProto.java:12872)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
>         at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13599)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.<init>(OrcProto.java:13546)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
>         at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>         at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>         at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>         at 
> com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
>         at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.<init>(ReaderImpl.java:468)
>         at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:314)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
>         at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> The only solution for this is to programmatically increase the 
> CodeInputStream size limit. We should make this configurable via hive config 
> so that the orc file is readable. Alternatively, we can keep increasing the 
> size until it parsing succeeds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to