Optimal approach for changing file format of a partitioned table

Elliot West Sat, 04 Aug 2018 04:28:23 -0700

Hi,

I’m trying to simply change the format of a very large partitioned table
from Json to ORC. I’m finding that it is unexpectedly resource intensive,
primarily due to a shuffle phase with the partition key. I end up running
out of disk space in what looks like a spill to disk in the reducers.
However, the partitioning scheme is identical on both the source and the
destination so my expectation is a map only job that simply rencodes each
file.


I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
could resolve my issue by allocating more storage to the task nodes.
However, can anyone advise a more resource and time efficient approach?

Cheers,

Elliot.

Optimal approach for changing file format of a partitioned table

Reply via email to