Hi, I am using Hive 2.1.0 on amazon EMR. In my 'insert overwrite' job, whose source and target tables are both in S3, i notice 2 copies of result is created in temp directory on S3.
First the output of the query is written to temp directory (e.g: ext-10000) in S3 by the MR job. Then the MR job completes, but the hive client still doesn't terminate. Instead i see that the entire temp directory is copied in S3 again, into another directory (e.g: tmp-ext-10000), file by file. Is this a known issue? In my case, my query reads about 0.5 terabyte of data, performs aggregation and writes back to S3. The second copy is so slow and usually fails with NoHttpResponseException from S3. Let me know if this is a known issue, if there are workarounds, of if there are config options to avoid 2 copies. Thanks, pala