Re: Convert each partition of RDD to Dataframe

Enrico Minack Thu, 27 Feb 2020 07:22:17 -0800

Manjunath,

You can define your DataFrame in parallel in a multi-threaded driver.


Enrico

Am 27.02.20 um 15:50 schrieb Manjunath Shetty H:

Hi Enrico,

In that case how to make effective use of all nodes in the cluster ?.

And also whats your opinion on the below

  * Create 10 Dataframes sequentially in Driver program and
    transform/write to hdfs one after the other
  * Or the current approach mentioned in the previous mail

What will be the performance implications ?

Regards
Manjunath

------------------------------------------------------------------------
*From:* Enrico Minack <m...@enrico.minack.dev>
*Sent:* Thursday, February 27, 2020 7:57 PM
*To:* user@spark.apache.org <user@spark.apache.org>
*Subject:* Re: Convert each partition of RDD to Dataframe
Hi Manjunath,

why not creating 10 DataFrames loading the different tables in thefirst place?


Enrico


Am 27.02.20 um 14:53 schrieb Manjunath Shetty H:

Hi Vinodh,
Thanks for the quick response. Didn't got what you meant exactly, anyreference or snippet will be helpful.
To explain the problem more,

  * I have 10 partitions , each partition loads the data from
    different table and different SQL shard.
  * Most of the partitions will have different schema.
  * Before persisting the data i want to do some column level
    manipulation using data frame.
So thats why i want to create 10 (based on partitions ) dataframesthat maps to 10 different table/shard from a RDD.
Regards
Manjunath
------------------------------------------------------------------------
*From:* Charles vinodh <mig.flan...@gmail.com><mailto:mig.flan...@gmail.com>
*Sent:* Thursday, February 27, 2020 7:04 PM
*To:* manjunathshe...@live.com <mailto:manjunathshe...@live.com><manjunathshe...@live.com> <mailto:manjunathshe...@live.com>
*Cc:* user <user@spark.apache.org> <mailto:user@spark.apache.org>
*Subject:* Re: Convert each partition of RDD to Dataframe
Just split the single rdd into multiple individual rdds using afilter operation and then convert each individual rdds to it'srespective dataframe..
On Thu, Feb 27, 2020, 7:29 AM Manjunath Shetty H<manjunathshe...@live.com <mailto:manjunathshe...@live.com>> wrote:
    Hello All,

    In spark i am creating the custom partitions with Custom RDD,
    each partition will have different schema. Now in the
    transformation step we need to get the schema and run some
    Dataframe SQL queries per partition, because each partition data
    has different schema.

    How to get the Dataframe's per partition of a RDD?.

    As of now i am doing|foreachPartition|on RDD and
    converting|Iterable<Row>|to|List|and converting that to
    Dataframe. But the problem is converting|Iterable|to|List|will
    bring all the data to memory and it might crash the process.

    Is there any known way to do this ? or is there any way to handle
    Custom Partitions in|Dataframes|instead of using|RDD|?

    I am using Spark version|1.6.2|.

    Any pointers would be helpful. Thanks in advance

Re: Convert each partition of RDD to Dataframe

Reply via email to