Re: Hive Compaction OOM

Eugene Koifman Mon, 17 Sep 2018 16:06:19 -0700

hive.compactor.max.num.delta

This lets control how many deltas are opened at once.  By default it’s 500 
which may be too much.
So the compactor will use this do exactly what Owen is suggesting.

The current impl will do everything sequentially but better than OOM.

Eugene

From: Owen O'Malley <owen.omal...@gmail.com>
Reply-To: "user@hive.apache.org" <user@hive.apache.org>
Date: Monday, September 17, 2018 at 2:00 PM
To: "user@hive.apache.org" <user@hive.apache.org>
Subject: Re: Hive Compaction OOM

Ok, if you are against the wall, I'd suggest looking at the CompactorMR class, 
which is the class that the Metastore uses to launch the compactor jobs. You'll 
need to write code to call it with Table, StorageDescriptor, and ValidTxnList 
to do the minor compaction on a set of transactions. For example, if you can 
read 5 delta files at once and you have 25, you could group them into sets:

delta_100_100 .. delta_104_104 -> delta_100_104
...
delta_120_120 .. delta_124_124 -> delta_120_124

and then merge together the combined delta files.

delta_100_104 ... delta_120_124 -> delta_100_124

But you'd need to find the limit of how many transactions you can merge at once 
by testing.

On the plus side, the ACID v2 layout doesn't have this problem since it only 
needs to read a single ORC file at a time and as Prasanth pointed out we have 
fixed the statistics so that long strings are truncated before they are written 
into the file footer in ORC-203.

.. Owen

On Mon, Sep 17, 2018 at 1:44 PM Shawn Weeks 
<swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>> wrote:

I've already tried giving the compactor 256+ gigabytes of memory. All that 
changes is how long for it run out of memory.

Thanks

Shawn Weeks

________________________________
From: Owen O'Malley <owen.omal...@gmail.com<mailto:owen.omal...@gmail.com>>
Sent: Monday, September 17, 2018 3:37:09 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hive Compaction OOM

How many files is it trying to merge at once? By far the easiest thing to do 
will be to give the compactor job more heap to work with. In theory you could 
do multiple rounds of minor compaction to get around the problem. 
Unfortunately, the tool isn't designed to do that and I'm worried that without 
automation you would be risking your data.

.. Owen

On Mon, Sep 17, 2018 at 1:14 PM Shawn Weeks 
<swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>> wrote:

Tried the Binary thing but since  Hive Streaming  in HDP 2.6 doesn't support 
Binary column types that's not going to work. See HIVE-18613.

Thanks

Shawn Weeks

________________________________
From: Shawn Weeks <swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>>
Sent: Monday, September 17, 2018 12:28:25 PM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hive Compaction OOM

2018-09-17 11:20:26,404 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error 
running child : java.lang.OutOfMemoryError: Java heap space

    at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:864)

    at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1331)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.<init>(OrcProto.java:1281)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)

    at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4897)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.<init>(OrcProto.java:4813)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:5005)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:5000)

    at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.<init>(OrcProto.java:15836)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.<init>(OrcProto.java:15744)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15886)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15881)

    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89)

    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95)

    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.parseFrom(OrcProto.java:16247)

    at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFooter(ReaderImpl.java:459)

    at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractFileTail(ReaderImpl.java:438)

    at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:319)

    at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:241)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.<init>(OrcRawRecordMerger.java:480)

    at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRawReader(OrcInputFormat.java:1546)

    at 
org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:655)

    at 
org.apache.hadoop.hive.ql.txn.compactor.CompactorMR$CompactorMap.map(CompactorMR.java:633)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)

    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)

________________________________
From: Owen O'Malley <owen.omal...@gmail.com<mailto:owen.omal...@gmail.com>>
Sent: Monday, September 17, 2018 11:28:43 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Hive Compaction OOM

Shawn,
   Can you provide the stack trace that you get with the OOM?

Thanks,
   Owen

On Mon, Sep 17, 2018 at 9:27 AM Prasanth Jayachandran 
<pjayachand...@hortonworks.com<mailto:pjayachand...@hortonworks.com>> wrote:
Hi Shawn

You might be running into issues related to huge protobuf objects from huge 
string columns. Without 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/ORC-203 there isn’t 
an option other than providing sufficiently large memory. If you can reload the 
data with binary type that should help avoid this issue.

Thanks
Prasanth

On Mon, Sep 17, 2018 at 9:10 AM -0700, "Shawn Weeks" 
<swe...@weeksconsulting.us<mailto:swe...@weeksconsulting.us>> wrote:

Let me start off by saying I've backed myself into a corner and would rather 
not reprocess the data if possible. I have a Hive Transactional table in Hive 
1.2.1 H that was loaded via NiFi Hive Streaming with a fairly large String 
column containing XML Documents. Awful I know and I'm working on changing how 
the data get's loaded. But I've got this table with so many deltas that the 
Hive Compaction runs out of memory and any queries on the table run out of 
memory. Any ideas on how I might get the data out of the table and split it 
into more buckets or something?

Thanks

Shawn Weeks

Re: Hive Compaction OOM

Reply via email to