Hi,

I’m trying to simply change the format of a very large partitioned table
from Json to ORC. I’m finding that it is unexpectedly resource intensive,
primarily due to a shuffle phase with the partition key. I end up running
out of disk space in what looks like a spill to disk in the reducers.
However, the partitioning scheme is identical on both the source and the
destination so my expectation is a map only job that simply rencodes each
file.

I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
could resolve my issue by allocating more storage to the task nodes.
However, can anyone advise a more resource and time efficient approach?

Cheers,

Elliot.

Reply via email to