A hive version would help to preface this, because that matters for this (like TEZ-3709 doesn't apply for hive-1.2).
> I’m trying to simply change the format of a very large partitioned table from > Json to ORC. I’m finding that it is unexpectedly resource intensive, > primarily due to a shuffle phase with the partition key. I end up running out > of disk space in what looks like a spill to disk in the reducers. The "shuffle phase with the partition key" sounds like you have the dynamic sort partition enabled, which is necessary to avoid OOMs on the writer due to split generation complications (as you'll see below). > However, the partitioning scheme is identical on both the source and the > destination so my expectation is a map only job that simply rencodes each > file. That would've been nice to have, except the split generation in MR InputFormats will use the locality of files & split a single partition into multiple splits, then recombine them by hostname - so the splits aren't aligned along partitions. > However, can anyone advise a more resource and time efficient approach? If you don't have enough scratch space to store the same data 2x (well, a minimum - the shuffle merge has a complete spill for every 100 inputs), it might be helpful to do this as separate jobs (i.e relaunch AMs) so that you can delete all the scratch storage between the partitions. The usual chunk size I use is around 30Tb per insert (this corresponds to 7 years in my warehouse tests). I have for loop scripts which go over the data & generate chunked insert scripts, but they are somewhat trivial to write for a different use-case. The scratch-space issue is actually tied to some assumptions in this codepath (all the way from 2007), which optimizes for shuffle via a spinning disk, for the spill + merge (to cut down on IOPS). I hope I can rewrite it entirely with something like Apache Crail (to take advantage of NVRAM+RDMA) once there's no need for compatibility with spinning disks. However, most of the next set of optimizations require a closer inspection of the counters from the task, cluster size and total data-size. Cheers, Gopal