[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Xuefu Zhang (JIRA) Fri, 03 Jan 2014 13:05:16 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861903#comment-13861903
 ]


Xuefu Zhang commented on HIVE-6134:
-----------------------------------

[~ericchu30] I guess my above comments was a little off the topic. I thought 
the problem you mentioned was about too many small files for a table (which my 
comments above was mostly about) but now I realized that the problem is about a 
query resulting too many tables. Thanks for your clarifications.

The two problems are different yet seemingly related. I'm wondering if the 
problem #2 (too many small files from a query) is root caused by problem #1 
(too many small files for a table). I cannot image a case of that (besides too 
many mappers or reducers), but appreciate if you can share your case.

If the answer is yes, then the proposal that I outlined above may prevent 
problem #2 from happening. If no, then it may makes sense to have both. For 
information only, HIVE-439, which originally introduced the merge feature, 
seems targeting only at small files from mappers, no mentioning either this is 
for query result or table files. However, the comments did mention about 
movetask, which may be related to the code you saw.

For the Hue issue you mentioned, I'd think that getting rid of the small files 
one way or the other seems reasonable.  

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
>                 Key: HIVE-6134
>                 URL: https://issues.apache.org/jira/browse/HIVE-6134
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive 
> will launch an additional MR job to merge the small output files at the end 
> of a map-only job when the average output file size is smaller than 
> hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles 
> to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my 
> observation is that this is only true for CTAS queries. In 
> GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used 
> if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a 
> regular SELECT query that doesn't have move tasks, these properties are not 
> used.
> Is my understanding correct and if so, what's the reasoning behind the logic 
> of not supporting this for regular SELECT queries? It seems to me that this 
> should be supported for regular SELECT queries as well. One scenario where 
> this hits us hard is when users try to download the result in HUE, and HUE 
> times out b/c there are thousands of output files. The workaround is to 
> re-run the query as CTAS, but it's a significant time sink.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Reply via email to