Eric Chu created HIVE-6134:
------------------------------

             Summary: Merging small files based on file size only works for 
CTAS queries
                 Key: HIVE-6134
                 URL: https://issues.apache.org/jira/browse/HIVE-6134
             Project: Hive
          Issue Type: Bug
    Affects Versions: 0.12.0, 0.11.0, 0.10.0, 0.8.0
            Reporter: Eric Chu


According to the documentation, if we set hive.merge.mapfiles to true, Hive 
will launch an additional MR job to merge the small output files at the end of 
a map-only job when the average output file size is smaller than 
hive.merge.smallfiles.avgsize. Similarly, by setting hive.merge.mapredfiles to 
true, Hive will merge the output files of a map-reduce job. 

My expectation is that this is true for all MR queries. However, my observation 
is that this is only true for CTAS queries. In GenMRFileSink1.java, 
HIVEMERGEMAPFILES and HIVEMERGEMAPREDFILES are only used if ((ctx.getMvTask() 
!= null) && (!ctx.getMvTask().isEmpty())). So, for a regular SELECT query that 
doesn't have move tasks, these properties are not used.

Is my understanding correct and if so, what's the reasoning behind the logic of 
not supporting this for regular SELECT queries? It seems to me that this should 
be supported for regular SELECT queries as well. One scenario where this hits 
us hard is when users try to download the result in HUE, and HUE times out b/c 
there are thousands of output files. The workaround is to re-run the query as 
CTAS, but it's a significant time sink.





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to