Tried the Binary thing but since Hive Streaming in HDP 2.6 doesn't support Binary column types that's not going to work. See HIVE-18613.
Thanks Shawn Weeks ________________________________ From: Shawn Weeks <swe...@weeksconsulting.us> Sent: Monday, September 17, 2018 12:28:25 PM To: user@hive.apache.org Subject: Re: Hive Compaction OOM 2018-09-17 11:20:26,404 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space at com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:864) at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4897) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4813) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:5005) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:5000) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.<init>(OrcProto.java:15836) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.<init>(OrcProto.java:15744) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15886) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15881) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.parseFrom(OrcProto.java:16247) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFooter(ReaderImpl.java:459) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:438) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:319) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241) at org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:480) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRawReader(OrcInputFormat.java:1546) at org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:655) at org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:633) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) ________________________________ From: Owen O'Malley <owen.omal...@gmail.com> Sent: Monday, September 17, 2018 11:28:43 AM To: user@hive.apache.org Subject: Re: Hive Compaction OOM Shawn, Can you provide the stack trace that you get with the OOM? Thanks, Owen On Mon, Sep 17, 2018 at 9:27 AM Prasanth Jayachandran <pjayachand...@hortonworks.com<mailto:pjayachand...@hortonworks.com>> wrote: Hi Shawn You might be running into issues related to huge protobuf objects from huge string columns. Without https://issues.apache.org/jira/plugins/servlet/mobile#issue/ORC-203 there isn’t an option other than providing sufficiently large memory. If you can reload the data with binary type that should help avoid this issue. Thanks Prasanth On Mon, Sep 17, 2018 at 9:10 AM -0700, "Shawn Weeks" <swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>> wrote: Let me start off by saying I've backed myself into a corner and would rather not reprocess the data if possible. I have a Hive Transactional table in Hive 1.2.1 H that was loaded via NiFi Hive Streaming with a fairly large String column containing XML Documents. Awful I know and I'm working on changing how the data get's loaded. But I've got this table with so many deltas that the Hive Compaction runs out of memory and any queries on the table run out of memory. Any ideas on how I might get the data out of the table and split it into more buckets or something? Thanks Shawn Weeks