[jira] [Created] (HIVE-15216) Files on S3 are deleted one by one in INSERT OVERWRITE queries

Sahil Takiar (JIRA) Tue, 15 Nov 2016 15:24:18 -0800

Sahil Takiar created HIVE-15216:
-----------------------------------

             Summary: Files on S3 are deleted one by one in INSERT OVERWRITE 
queries
                 Key: HIVE-15216
                 URL: https://issues.apache.org/jira/browse/HIVE-15216
             Project: Hive
          Issue Type: Sub-task
          Components: Hive
            Reporter: Sahil Takiar



When running {{INSERT OVERWRITE}} queries the files to overwrite are deleted 
one by one. The reason is that, by default, hive.exec.stagingdir is inside the 
target table directory.

Ideally Hive would just delete the entire table directory, but it can't do that 
since the staging data is also inside the directory. Instead it deletes each 
file one-by-one, which is very slow.

There are a few ways to fix this:

1: Move the staging directory outside the table location. This can be done by  
setting hive.exec.stagingdir to a different location when running on S3. It 
would be nice if users didn't have to explicitly set this when running on S3 
and things just worked out-of-the-box. My understanding is that 
hive.exec.stagingdir was only added to support HDFS encryption zones. Since S3 
doesn't have encryption zones, there should be no problem with using the value 
of hive.exec.scratchdir to store all intermediate data instead.

2: Multi-thread the delete operations

3: See if the {{S3AFileSystem}} can expose some type of bulk delete op



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-15216) Files on S3 are deleted one by one in INSERT OVERWRITE queries

Reply via email to