[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Eric Chu (JIRA) Sat, 04 Jan 2014 17:29:15 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862462#comment-13862462
 ]


Eric Chu commented on HIVE-6134:
--------------------------------

[~xuefu.w...@kodak.com] We notice that the problem occurs when a query results 
in too many files; however, this happens b/c the table has too many (but not 
necessarily small) files. Most of the queries that have this problem are 
regular SELECT FROM WHERE queries (no GROUP BY) that don't have reducers. Some 
of our tables have hundreds of GBs per partition; the biggest one has TBs of 
data per partition. It's not uncommon to see queries with thousands or tens of 
thousands of mappers, but no reducers. 

We are looking at other ways to mitigate this problem. What you suggest - 
merging files in a partition - is certainly something we are considering. 
Meanwhile, I want to consider supporting these properties for queries without a 
move task. Specifically, what are the reasons that we didn't support these 
properties for queries without a move tasks? And if we want to do do, what 
considerations should we make? We'd be willing to work on this, but we probably 
will need some guidance from domain experts. Thanks!

> Merging small files based on file size only works for CTAS queries
> ------------------------------------------------------------------
>
>                 Key: HIVE-6134
>                 URL: https://issues.apache.org/jira/browse/HIVE-6134
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.8.0, 0.10.0, 0.11.0, 0.12.0
>            Reporter: Eric Chu
>
> According to the documentation, if we set hive.merge.mapfiles to true, Hive 
> will launch an additional MR job to merge the small output files at the end 
> of a map-only job when the average output file size is smaller than 
> hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles 
> to true, Hive will merge the output files of a map-reduce job. 
> My expectation is that this is true for all MR queries. However, my 
> observation is that this is only true for CTAS queries. In 
> GenMRFileSink1.java, HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used 
> if ((ctx.getMvTask() != null) && (!ctx.getMvTask().isEmpty())). So, for a 
> regular SELECT query that doesn't have move tasks, these properties are not 
> used.
> Is my understanding correct and if so, what's the reasoning behind the logic 
> of not supporting this for regular SELECT queries? It seems to me that this 
> should be supported for regular SELECT queries as well. One scenario where 
> this hits us hard is when users try to download the result in HUE, and HUE 
> times out b/c there are thousands of output files. The workaround is to 
> re-run the query as CTAS, but it's a significant time sink.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HIVE-6134) Merging small files based on file size only works for CTAS queries

Reply via email to