Hive 2.1.0 - writes multiple copies of same data to temp location

Palanieppan Muthiah Fri, 09 Dec 2016 07:08:43 -0800

Hi,

I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose
source and target tables are both in S3, i notice 2 copies of result is
created in temp directory on S3.


First the output of the query is written to temp directory (e.g: ext-10000)
in S3 by the MR job. Then the MR job completes, but the hive client still
doesn't terminate. Instead i see that the entire temp directory is copied
in S3 again, into another directory (e.g: tmp-ext-10000), file by file.

Is this a known issue? In my case, my query reads about 0.5 terabyte of
data, performs aggregation and writes back to S3. The second copy is so
slow and usually fails with NoHttpResponseException from S3.

Let me know if this is a known issue, if there are workarounds, of if there
are config options to avoid 2 copies.


Thanks,
pala

Hive 2.1.0 - writes multiple copies of same data to temp location

Reply via email to