[ 
https://issues.apache.org/jira/browse/HIVE-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566207#comment-14566207
 ] 

Lefty Leverenz commented on HIVE-7810:
--------------------------------------

Adding TODOC15 (which means TODOC1.1.0).

Besides documenting *hive.merge.sparkfiles* in Configuration Properties, usage 
notes should be included in the HoS doc.  Also see HIVE-8043, Support merging 
small files.

* [Hive on Spark: Getting Started | 
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started]
* [Configuration Properties -- Spark | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Spark]
with crossreferences to & from:
** [hive.merge.mapfiles | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.merge.mapfiles]
** [hive.merge.mapredfiles | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.merge.mapredfiles]
** and maybe [hive.optimize.union.remove | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.union.remove]
 (see following question)

Does *hive.merge.sparkfiles* affect *hive.optimize.union.remove* like 
*hive.merge.mapfiles* and *hive.merge.mapredfiles*?

bq.  The merge is triggered if either of hive.merge.mapfiles or 
hive.merge.mapredfiles is set to true. If the user has set hive.merge.mapfiles 
to true and hive.merge.mapredfiles to false, the idea was that the number of 
reducers are few, so the number of files anyway is small. However, with this 
optimization, we are increasing the number of files possibly by a big margin. 
So, we merge aggresively.


> Insert overwrite table query has strange behavior when set 
> hive.optimize.union.remove=true [Spark Branch]
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7810
>                 URL: https://issues.apache.org/jira/browse/HIVE-7810
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Na Yang
>            Assignee: Na Yang
>              Labels: TODOC-SPARK, TODOC15
>             Fix For: 1.1.0
>
>         Attachments: HIVE-7810.1-spark.patch
>
>
> Insert overwrite table query has strange behavior when 
> set hive.optimize.union.remove=true
> set hive.mapred.supports.subdirectories=true;
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=true;
> We expect the following two sets of queries return the same set of data 
> result, but they do not. 
> 1)
> {noformat}
> insert overwrite table outputTbl1
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b;
> select * from outputTbl1 order by key, values;
> {noformat}
> Below is the query result:
> {noformat}
> 1     1
> 1     2
> 2     1
> 2     2
> 3     1
> 3     2
> 7     1
> 7     2
> 8     2
> 8     2
> 8     2
> {noformat}
> 2) 
> {noformat}
> SELECT * FROM
> (
> select key, 1 as values from inputTbl1
> union all
> select * FROM (
>   SELECT key, count(1) as values from inputTbl1 group by key
>   UNION ALL
>   SELECT key, 2 as values from inputTbl1
> ) a
> )b order by key, values;
> {noformat}
> Below is the query result:
> {noformat}
> 1     1
> 1     1
> 1     2
> 2     1
> 2     1
> 2     2
> 3     1
> 3     1
> 3     2
> 7     1
> 7     1
> 7     2
> 8     1
> 8     1
> 8     2
> 8     2
> 8     2
> {noformat}
> Some data is missing in the first set of query result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to