Re: Pyspark Partitioning

ayan guha Sun, 30 Sep 2018 14:49:28 -0700

Hi

There are a set pf finction which can be used with the construct
Over (partition by col order by col).


You search for rank and window functions in spark documentation.

On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari <ferra...@gmail.com> wrote:

> Hi Dimitris,
>
> I believe the methods partitionBy
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.partitionBy>
> and mapPartitions
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD.mapPartitions>
> are specific to RDDs while you're talking about DataFrames
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame>.
> I guess you have few options including:
> 1. use the Dataframe.rdd
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.rdd>
> call and process the returned RDD. Please note the return type for this
> call is and RDD of Row
> 2. User the groupBy
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy>
> from Dataframes and start from there, this may involved defining an udf or
> leverage on the existing GroupedData
> <https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.GroupedData>
> functions.
>
> It really depends on your use-case and your performance requirements.
> HTH
>
> On Sun, Sep 30, 2018 at 8:31 PM dimitris plakas <dimitrisp...@gmail.com>
> wrote:
>
>> Hello everyone,
>>
>> I am trying to split a dataframe on partitions and i want to apply a
>> custom function on every partition. More precisely i have a dataframe like
>> the one below
>>
>> Group_Id | Id | Points
>> 1            | id1| Point1
>> 2            | id2| Point2
>>
>> I want to have a partition for every Group_Id and apply on every
>> partition a function defined by me.
>> I have tried with partitionBy('Group_Id').mapPartitions() but i receive
>> error.
>> Could you please advice me how to do it?
>>
> --
Best Regards,
Ayan Guha

Re: Pyspark Partitioning

Reply via email to