on, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouz...@cs.toronto.edu
>
t;
>>>
>>>
>>> *From:* Fei Hu [mailto:hufe...@gmail.com]
>>> *Sent:* Sunday, January 15, 2017 10:10 PM
>>> *To:* zouz...@cs.toronto.edu
>>> *Cc:* user @spark ; dev@spark.apache.org
>>> *Subject:* Re: Equally split a RDD partition into two p
Hi Liang-Chi,
Yes, the logic split is needed in compute(). The preferred locations can be
derived from the customized Partition class.
Thanks for your help!
Cheers,
Fei
On Mon, Jan 16, 2017 at 3:00 AM, Liang-Chi Hsieh wrote:
>
> Hi Fei,
>
> I think it should work. But you may need to add few
Hi Fei,
I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.
Fei Hu wrote
> Hi Liang-Chi,
>
> Yes, you a
don’t think
> RDD number of partitions will be increased.
>
>
>
> Thanks,
>
> Jasbir
>
>
>
> *From:* Fei Hu [mailto:hufe...@gmail.com]
> *Sent:* Sunday, January 15, 2017 10:10 PM
> *To:* zouz...@cs.toronto.edu
> *Cc:* user @spark ; dev@spark.apache.org
> *Su
: zouz...@cs.toronto.edu
Cc: user @spark ; dev@spark.apache.org
Subject: Re: Equally split a RDD partition into two partition at the same node
Hi Anastasios,
Thanks for your reply. If I just increase the numPartitions to be twice larger,
how coalesce(numPartitions: Int, shuffle: Boolean = false
Hi Liang-Chi,
Yes, you are right. I implement the following solution for this problem,
and it works. But I am not sure if it is efficient:
I double the partitions of the parent RDD, and then use the new partitions
and parent RDD to construct the target RDD. In the compute() function of
the target
Hi,
When calling `coalesce` with `shuffle = false`, it is going to produce at
most min(numPartitions, previous RDD's number of partitions). So I think it
can't be used to double the number of partitions.
Anastasios Zouzias wrote
> Hi Fei,
>
> How you tried coalesce(numPartitions: Int, shuffle:
Hi Anastasios,
Thanks for your information. I will look into the CoalescedRDD code.
Thanks,
Fei
On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias
wrote:
> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of which, CoalescedRDD is p
Hi Fei,
I looked at the code of CoalescedRDD and probably what I suggested will not
work.
Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see
https://github.com/apache/spark/blob/branch-1.6/core/src/main/sc
Hi Anastasios,
Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?
Thanks,
Fei
On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias
wrote:
> Hi
Hi Rishi,
Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data locali
Hi Fei,
How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395
coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a nar
13 matches
Mail list logo