Hi, I’m trying to simply change the format of a very large partitioned table from Json to ORC. I’m finding that it is unexpectedly resource intensive, primarily due to a shuffle phase with the partition key. I end up running out of disk space in what looks like a spill to disk in the reducers. However, the partitioning scheme is identical on both the source and the destination so my expectation is a map only job that simply rencodes each file.
I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I could resolve my issue by allocating more storage to the task nodes. However, can anyone advise a more resource and time efficient approach? Cheers, Elliot.