[ https://issues.apache.org/jira/browse/HIVE-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336431#comment-16336431 ]
liyunzhang commented on HIVE-8436: ---------------------------------- [~csun]: thanks for reply, {quote} without the copying function, the RDD cache will cache *references* {quote} I have not found this in spark [document|https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#rdd-persistence]. Spark RDD cache reference not value. If have time, can you provide a link which explain it? {code:java} private static class CopyFunction implements PairFunction<Tuple2<WritableComparable, Writable>, WritableComparable, Writable> { private transient Configuration conf; @Override public Tuple2<WritableComparable, Writable> call(Tuple2<WritableComparable, Writable> tuple) throws Exception { if (conf == null) { conf = new Configuration(); } return new Tuple2<WritableComparable, Writable>(tuple._1(), WritableUtils.clone(tuple._2(), conf)); } }{code} {{WritableUtils.clone(tuple._2(), conf))}} used to clone tuple._2() to a new variable. This means tuple._2() is instance of Class which can be cloned. For Text type, it is ok. For orc/parquet format, it is not ok because [HIVE-18289|https://issues.apache.org/jira/browse/HIVE-18289]. The reason is OrcStruct doesn't have an empty constructor for orc when ReflectionUtils.newInstance is called and a similar reason for parquet format. Is there anyway to solve it? > Modify SparkWork to split works with multiple child works [Spark Branch] > ------------------------------------------------------------------------ > > Key: HIVE-8436 > URL: https://issues.apache.org/jira/browse/HIVE-8436 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Chao Sun > Priority: Major > Fix For: 1.1.0 > > Attachments: HIVE-8436.1-spark.patch, HIVE-8436.10-spark.patch, > HIVE-8436.11-spark.patch, HIVE-8436.2-spark.patch, HIVE-8436.3-spark.patch, > HIVE-8436.4-spark.patch, HIVE-8436.5-spark.patch, HIVE-8436.6-spark.patch, > HIVE-8436.7-spark.patch, HIVE-8436.8-spark.patch, HIVE-8436.9-spark.patch > > > Based on the design doc, we need to split the operator tree of a work in > SparkWork if the work is connected to multiple child works. The way splitting > the operator tree is performed by cloning the original work and removing > unwanted branches in the operator tree. Please refer to the design doc for > details. > This process should be done right before we generate SparkPlan. We should > have a utility method that takes the orignal SparkWork and return a modified > SparkWork. > This process should also keep the information about the original work and its > clones. Such information will be needed during SparkPlan generation > (HIVE-8437). -- This message was sent by Atlassian JIRA (v7.6.3#76005)