Re: Equally split a RDD partition into two partition at the same node

Fei Hu Sun, 15 Jan 2017 21:43:12 -0800

Hi Jasbir,

Yes, you are right. Do you have any idea about my question?


Thanks,
Fei

On Mon, Jan 16, 2017 at 12:37 AM, <[email protected]> wrote:

> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current partition, I don’t think
> RDD number of partitions will be increased.
>
>
>
> Thanks,
>
> Jasbir
>
>
>
> *From:* Fei Hu [mailto:[email protected]]
> *Sent:* Sunday, January 15, 2017 10:10 PM
> *To:* [email protected]
> *Cc:* user @spark <[email protected]>; [email protected]
> *Subject:* Re: Equally split a RDD partition into two partition at the
> same node
>
>
>
> Hi Anastasios,
>
>
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
>
>
> Thanks,
>
> Fei
>
>
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <[email protected]>
> wrote:
>
> Hi Fei,
>
>
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>
>
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
>
>
> Best,
>
> Anastasios
>
>
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <[email protected]> wrote:
>
> Dear all,
>
>
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
>
>
> Is there anyone who knows how to implement it or any hints for it?
>
>
>
> Thanks in advance,
>
> Fei
>
>
>
>
>
>
>
> --
>
> -- Anastasios Zouzias
>
>
>
> ------------------------------
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> ____________________________________________________________
> __________________________
>
> www.accenture.com
>

Re: Equally split a RDD partition into two partition at the same node

Reply via email to