Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Fei Hu
; >>> >>> >>> *From:* Fei Hu [mailto:hufe...@gmail.com] >>> *Sent:* Sunday, January 15, 2017 10:10 PM >>> *To:* zouz...@cs.toronto.edu >>> *Cc:* user @spark ; d...@spark.apache.org >>> *Subject:* Re: Equally split a RDD partition into two p

Re: Equally split a RDD partition into two partition at the same node

2017-01-16 Thread Pradeep Gollakota
on, I don’t think >> RDD number of partitions will be increased. >> >> >> >> Thanks, >> >> Jasbir >> >> >> >> *From:* Fei Hu [mailto:hufe...@gmail.com] >> *Sent:* Sunday, January 15, 2017 10:10 PM >> *To:* zouz...@cs.toronto.edu >&

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
don’t think > RDD number of partitions will be increased. > > > > Thanks, > > Jasbir > > > > *From:* Fei Hu [mailto:hufe...@gmail.com] > *Sent:* Sunday, January 15, 2017 10:10 PM > *To:* zouz...@cs.toronto.edu > *Cc:* user @spark ; d...@spark.apache.org >

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
: zouz...@cs.toronto.edu Cc: user @spark ; d...@spark.apache.org Subject: Re: Equally split a RDD partition into two partition at the same node Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your information. I will look into the CoalescedRDD code. Thanks, Fei On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias wrote: > Hi Fei, > > I looked at the code of CoalescedRDD and probably what I suggested will > not work. > > Speaking of which, CoalescedRDD is p

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, I looked at the code of CoalescedRDD and probably what I suggested will not work. Speaking of which, CoalescedRDD is private[spark]. If this was not the case, you could set balanceSlack to 1, and get what you requested, see https://github.com/apache/spark/blob/branch-1.6/core/src/main/sc

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios, Thanks for your reply. If I just increase the numPartitions to be twice larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data locality? Do I need to define my own Partitioner? Thanks, Fei On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias wrote: > Hi

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi, Thanks for your reply! The RDD has 24 partitions, and the cluster has a master node + 24 computing nodes (12 cores per node). Each node will have a partition, and I want to split each partition to two sub-partitions on the same node to improve the parallelism and achieve high data locali

Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei, How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395 coalesce is mostly used for reducing the number of partitions before writing to HDFS, but it might still be a nar

Re: Equally split a RDD partition into two partition at the same node

2017-01-14 Thread Rishi Yadav
Can you provide some more details: 1. How many partitions does RDD have 2. How big is the cluster On Sat, Jan 14, 2017 at 3:59 PM Fei Hu wrote: > Dear all, > > I want to equally divide a RDD partition into two partitions. That means, > the first half of elements in the partition will create a new