[ 
https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054595#comment-13054595
 ] 

Siying Dong commented on HIVE-2201:
-----------------------------------

Yongqiang:
1. As I commented previously "According to Hairong Kuang, Hadoop's behavior for 
creating a new file is that it will automatically create it's parent directory 
if it doesn't exist. In that case, I removed the directory check and create 
part when writing to a new file."
2. I tested the codes. I ran the whole regression tests and tested several 
cases manually in the cluster. I tried to kill some tasks manually
3. I'll see whether there are another dependency so that I can remove the old 
one. Having two reloaded calls are the convention we have in the file. All 
other similar calls have one function with Path call and one with String call. 
4. The tree traversal logic is copied from localizeMRTmpFilesImpl(). The first 
look is to go through every operator tree. The second loop is to Breadth-First 
Search the operator tree to check any FileSyncOperator.
5. OK. I'll make the change. My understanding is that only FileSinkOperator and 
the BlockMerge file sink have the problem and the second one is going to have 
some large changes by HIVE-2035. Also BlockMerge file sink suffers the problem 
less as it runs faster that has less change to have incomplete results.

> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
>                 Key: HIVE-2201
>                 URL: https://issues.apache.org/jira/browse/HIVE-2201
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Namit Jain
>            Assignee: Siying Dong
>         Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to