[ https://issues.apache.org/jira/browse/HIVE-28026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810378#comment-17810378 ]
Raghav Aggarwal commented on HIVE-28026: ---------------------------------------- StackTrace: {code:java} Caused by: com.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either that the input has been truncated or that an embedded message misreported its own length.at com.google.protobuf.InvalidProtocolBufferException.truncatedMessage(InvalidProtocolBufferException.java:115)at com.google.protobuf.CodedInputStream$StreamDecoder.pushLimit(CodedInputStream.java:2715)at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2407)at org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto.<init>(HiveHookEvents.java:1142)at org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto.<init>(HiveHookEvents.java:1018)at org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto$1.parsePartialFrom(HiveHookEvents.java:3391)at org.apache.hadoop.hive.ql.hooks.proto.HiveHookEvents$HiveHookEventProto$1.parsePartialFrom(HiveHookEvents.java:3385)at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2409)at org.apache.tez.dag.history.logging.proto.ProtoMessageWritable.readFields(ProtoMessageWritable.java:100)at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:2374)at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2347)at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)at org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat$1.next(ProtobufMessageInputFormat.java:124)at org.apache.hadoop.hive.ql.io.protobuf.ProtobufMessageInputFormat$1.next(ProtobufMessageInputFormat.java:84)at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:360)... 24 more{code} > Reading proto data more than 2GB from multiple splits fails > ----------------------------------------------------------- > > Key: HIVE-28026 > URL: https://issues.apache.org/jira/browse/HIVE-28026 > Project: Hive > Issue Type: Bug > Affects Versions: 4.0.0-beta-1 > Environment: > Reporter: Raghav Aggarwal > Assignee: Raghav Aggarwal > Priority: Major > > {*}Query{*}: select * from _<table_name>_ > {*}Explanation{*}: > On running the above mentioned query on a hive proto table, multiple tez > containers will be spawned to process the data. In a container, if there are > multiple hdfs splits and the combined size of decompressed data is more than > 2GB then the query fails with the following error: > > {code:java} > "While parsing a protocol message, the input ended unexpectedly in the middle > of a field. This could mean either that the input has been truncated or that > an embedded message misreported its own length." {code} > > > This is happening because of > _[CodedInputStream|https://github.com/protocolbuffers/protobuf/blob/54489e95e01882407f356f83c9074415e561db00/java/core/src/main/java/com/google/protobuf/CodedInputStream.java#L2712C7-L2712C16] > i.e. byteLimit += totalBytesRetired + pos;_ > _byteLimit_ is __ getting InterOverflow as _totalBytesRetired_ is retaining > all the bytes that it has read as CodedInputStream is initiliazed once for a > container > [https://github.com/apache/hive/blob/564d7e54d2360488611da39d0e5f027a2d574fc1/ql/src/java/org/apache/tez/dag/history/logging/proto/ProtoMessageWritable.java#L96] > . > > This is different from issue reproduced in > [https://github.com/zabetak/protobuf-large-message] as there it is a single > proto data file more than 2GB, but in my case, there are multiple file total > resulting in 2GB. > > *Limitation:* > This fix will still not resolve the issue which is mentioned > [https://github.com/protocolbuffers/protobuf/issues/11729] -- This message was sent by Atlassian Jira (v8.20.10#820010)