[ 
https://issues.apache.org/jira/browse/HIVE-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chengxiang Li updated HIVE-10550:
---------------------------------
    Attachment: HIVE-10550.3-spark.patch

After discussed with Xuefu, this patch just cache the RDD which is depended by 
multi other RDDs in JobGraph. MapInput cache is exluded from this patch, as 
it's very complexed and no much perfermance promotion gain.

> Dynamic RDD caching optimization for HoS.[Spark Branch]
> -------------------------------------------------------
>
>                 Key: HIVE-10550
>                 URL: https://issues.apache.org/jira/browse/HIVE-10550
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>         Attachments: HIVE-10550.1-spark.patch, HIVE-10550.1.patch, 
> HIVE-10550.2-spark.patch, HIVE-10550.3-spark.patch
>
>
> A Hive query may try to scan the same table multi times, like self-join, 
> self-union, or even share the same subquery, [TPC-DS 
> Q39|https://github.com/hortonworks/hive-testbench/blob/hive14/sample-queries-tpcds/query39.sql]
>  is an example. As you may know that, Spark support cache RDD data, which 
> mean Spark would put the calculated RDD data in memory and get the data from 
> memory directly for next time, this avoid the calculation cost of this 
> RDD(and all the cost of its dependencies) at the cost of more memory usage. 
> Through analyze the query context, we should be able to understand which part 
> of query could be shared, so that we can reuse the cached RDD in the 
> generated Spark job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to