Hi Liang-Chi, Yes, the logic split is needed in compute(). The preferred locations can be derived from the customized Partition class.
Thanks for your help! Cheers, Fei On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > > Hi Fei, > > I think it should work. But you may need to add few logic in compute() to > decide which half of the parent partition is needed to output. And you need > to get the correct preferred locations for the partitions sharing the same > parent partition. > > > Fei Hu wrote > > Hi Liang-Chi, > > > > Yes, you are right. I implement the following solution for this problem, > > and it works. But I am not sure if it is efficient: > > > > I double the partitions of the parent RDD, and then use the new > partitions > > and parent RDD to construct the target RDD. In the compute() function of > > the target RDD, I use the input partition to get the corresponding parent > > partition, and get the half elements in the parent partitions as the > > output > > of the computing function. > > > > Thanks, > > Fei > > > > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh < > > > viirya@ > > > > wrote: > > > >> > >> Hi, > >> > >> When calling `coalesce` with `shuffle = false`, it is going to produce > at > >> most min(numPartitions, previous RDD's number of partitions). So I think > >> it > >> can't be used to double the number of partitions. > >> > >> > >> Anastasios Zouzias wrote > >> > Hi Fei, > >> > > >> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? > >> > > >> > https://github.com/apache/spark/blob/branch-1.6/core/ > >> src/main/scala/org/apache/spark/rdd/RDD.scala#L395 > >> > > >> > coalesce is mostly used for reducing the number of partitions before > >> > writing to HDFS, but it might still be a narrow dependency (satisfying > >> > your > >> > requirements) if you increase the # of partitions. > >> > > >> > Best, > >> > Anastasios > >> > > >> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu < > >> > >> > hufei68@ > >> > >> > > wrote: > >> > > >> >> Dear all, > >> >> > >> >> I want to equally divide a RDD partition into two partitions. That > >> means, > >> >> the first half of elements in the partition will create a new > >> partition, > >> >> and the second half of elements in the partition will generate > another > >> >> new > >> >> partition. But the two new partitions are required to be at the same > >> node > >> >> with their parent partition, which can help get high data locality. > >> >> > >> >> Is there anyone who knows how to implement it or any hints for it? > >> >> > >> >> Thanks in advance, > >> >> Fei > >> >> > >> >> > >> > > >> > > >> > -- > >> > -- Anastasios Zouzias > >> > < > >> > >> > azo@.ibm > >> > >> > > > >> > >> > >> > >> > >> > >> ----- > >> Liang-Chi Hsieh | @viirya > >> Spark Technology Center > >> http://www.spark.tc/ > >> -- > >> View this message in context: http://apache-spark- > >> developers-list.1001551.n3.nabble.com/Equally-split-a- > >> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html > >> Sent from the Apache Spark Developers List mailing list archive at > >> Nabble.com. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe e-mail: > > > dev-unsubscribe@.apache > > >> > >> > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Equally-split-a- > RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >