Hi Liang-Chi,

Yes, the logic split is needed in compute(). The preferred locations can be
derived from the customized Partition class.

Thanks for your help!

Cheers,
Fei


On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

>
> Hi Fei,
>
> I think it should work. But you may need to add few logic in compute() to
> decide which half of the parent partition is needed to output. And you need
> to get the correct preferred locations for the partitions sharing the same
> parent partition.
>
>
> Fei Hu wrote
> > Hi Liang-Chi,
> >
> > Yes, you are right. I implement the following solution for this problem,
> > and it works. But I am not sure if it is efficient:
> >
> > I double the partitions of the parent RDD, and then use the new
> partitions
> > and parent RDD to construct the target RDD. In the compute() function of
> > the target RDD, I use the input partition to get the corresponding parent
> > partition, and get the half elements in the parent partitions as the
> > output
> > of the computing function.
> >
> > Thanks,
> > Fei
> >
> > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh &lt;
>
> > viirya@
>
> > &gt; wrote:
> >
> >>
> >> Hi,
> >>
> >> When calling `coalesce` with `shuffle = false`, it is going to produce
> at
> >> most min(numPartitions, previous RDD's number of partitions). So I think
> >> it
> >> can't be used to double the number of partitions.
> >>
> >>
> >> Anastasios Zouzias wrote
> >> > Hi Fei,
> >> >
> >> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> >> >
> >> > https://github.com/apache/spark/blob/branch-1.6/core/
> >> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> >> >
> >> > coalesce is mostly used for reducing the number of partitions before
> >> > writing to HDFS, but it might still be a narrow dependency (satisfying
> >> > your
> >> > requirements) if you increase the # of partitions.
> >> >
> >> > Best,
> >> > Anastasios
> >> >
> >> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
> >>
> >> > hufei68@
> >>
> >> > &gt; wrote:
> >> >
> >> >> Dear all,
> >> >>
> >> >> I want to equally divide a RDD partition into two partitions. That
> >> means,
> >> >> the first half of elements in the partition will create a new
> >> partition,
> >> >> and the second half of elements in the partition will generate
> another
> >> >> new
> >> >> partition. But the two new partitions are required to be at the same
> >> node
> >> >> with their parent partition, which can help get high data locality.
> >> >>
> >> >> Is there anyone who knows how to implement it or any hints for it?
> >> >>
> >> >> Thanks in advance,
> >> >> Fei
> >> >>
> >> >>
> >> >
> >> >
> >> > --
> >> > -- Anastasios Zouzias
> >> > &lt;
> >>
> >> > azo@.ibm
> >>
> >> > &gt;
> >>
> >>
> >>
> >>
> >>
> >> -----
> >> Liang-Chi Hsieh | @viirya
> >> Spark Technology Center
> >> http://www.spark.tc/
> >> --
> >> View this message in context: http://apache-spark-
> >> developers-list.1001551.n3.nabble.com/Equally-split-a-
> >> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> >> Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Equally-split-a-
> RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to