Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
on, I don’t think >> RDD number of partitions will be increased. >> >> >> >> Thanks, >> >> Jasbir >> >> >> >> *From:* Fei Hu [mailto:hufe...@gmail.com] >> *Sent:* Sunday, January 15, 2017 10:10 PM >> *To:* zouz...@cs.toronto.edu >

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
t; >>> >>> >>> *From:* Fei Hu [mailto:hufe...@gmail.com] >>> *Sent:* Sunday, January 15, 2017 10:10 PM >>> *To:* zouz...@cs.toronto.edu >>> *Cc:* user @spark ; dev@spark.apache.org >>> *Subject:* Re: Equally split a RDD partition into two p

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
Hi Liang-Chi, Yes, the logic split is needed in compute(). The preferred locations can be derived from the customized Partition class. Thanks for your help! Cheers, Fei On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh wrote: > > Hi Fei, > > I think it should work. But you may need to add few

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Liang-Chi Hsieh
Hi Fei, I think it should work. But you may need to add few logic in compute() to decide which half of the parent partition is needed to output. And you need to get the correct preferred locations for the partitions sharing the same parent partition. Fei Hu wrote > Hi Liang-Chi, > > Yes, you a

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
don’t think > RDD number of partitions will be increased. > > > > Thanks, > > Jasbir > > > > *From:* Fei Hu [mailto:hufe...@gmail.com] > *Sent:* Sunday, January 15, 2017 10:10 PM > *To:* zouz...@cs.toronto.edu > *Cc:* user @spark ; dev@spark.apache.org > *Su

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
: zouz...@cs.toronto.edu Cc: user @spark ; dev@spark.apache.org Subject: Re: Equally split a RDD partition into two partition at the same node Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Liang-Chi, Yes, you are right. I implement the following solution for this problem, and it works. But I am not sure if it is efficient: I double the partitions of the parent RDD, and then use the new partitions and parent RDD to construct the target RDD. In the compute() function of the target

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Liang-Chi Hsieh
Hi, When calling `coalesce` with `shuffle = false`, it is going to produce at most min(numPartitions, previous RDD's number of partitions). So I think it can't be used to double the number of partitions. Anastasios Zouzias wrote > Hi Fei, > > How you tried coalesce(numPartitions: Int, shuffle:

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your information. I will look into the CoalescedRDD code. Thanks, Fei On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias wrote: > Hi Fei, > > I looked at the code of CoalescedRDD and probably what I suggested will > not work. > > Speaking of which, CoalescedRDD is p

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work. Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see https://github.com/apache/spark/blob/branch-1.6/core/src/main/sc

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner? Thanks, Fei On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias wrote: > Hi

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi, Thanks for your reply! The RDD has 24 partitions, and the cluster has a master node + 24 computing nodes (12 cores per node). Each node will have a partition, and I want to split each partition to two sub-partitions on the same node to improve the parallelism and achieve high data locali

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395 coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a nar