Re: Optimal approach for changing file format of a partitioned table

Gopal Vijayaraghavan Mon, 06 Aug 2018 10:01:56 -0700


A hive version would help to  preface this, because that matters for this (like 
TEZ-3709 doesn't apply for hive-1.2).


> I’m trying to simply change the format of a very large partitioned table from 
> Json to ORC. I’m finding that it is unexpectedly resource intensive, 
> primarily due to a shuffle phase with the partition key. I end up running out 
> of disk space in what looks like a spill to disk in the reducers. 

The "shuffle phase with the partition key" sounds like you have the dynamic 
sort partition enabled, which is necessary to avoid OOMs on the writer due to 
split generation complications (as you'll see below).

> However, the partitioning scheme is identical on both the source and the 
> destination so my expectation is a map only job that simply rencodes each 
> file.

That would've been nice to have, except the split generation in MR InputFormats 
will use the locality of files & split a single partition into multiple splits, 
then recombine them by hostname - so the splits aren't aligned along partitions.

> However, can anyone advise a more resource and time efficient approach?

If you don't have enough scratch space to store the same data 2x (well, a 
minimum - the shuffle merge has a complete spill for every 100 inputs), it 
might be helpful to do this as separate jobs (i.e relaunch AMs) so that you can 
delete all the scratch storage between the partitions.

The usual chunk size I use is around 30Tb per insert (this corresponds to 7 
years in my warehouse tests).

I have for loop scripts which go over the data & generate chunked insert 
scripts, but they are somewhat trivial to write for a different use-case.

The scratch-space issue is actually tied to some assumptions in this codepath 
(all the way from 2007), which optimizes for shuffle via a spinning disk, for 
the spill + merge (to cut down on IOPS). I hope I can rewrite it entirely with 
something like Apache Crail (to take advantage of NVRAM+RDMA) once there's no 
need for compatibility with spinning disks.

However, most of the next set of optimizations require a closer inspection of 
the counters from the task, cluster size and total data-size.

Cheers,
Gopal

Re: Optimal approach for changing file format of a partitioned table

Reply via email to