Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

Neil Chang Sat, 23 Jul 2016 21:42:57 -0700

One example for using dapply is to apply linear regression on many small
partitions.
I think red can do that with parallelism too but heard dapply is faster.


On Friday, July 22, 2016, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote:

> I haven't used SparkR/R before, only Scala/Python APIs so I don't know for
> sure.
>
> I am guessing if things are in a DataFrame they were read either from some
> disk source (S3/HDFS/file/etc) or they were created from parallelize. If
> you are using the first, Spark will for the most part choose a reasonable
> number of partitions while for parallelize I think it depends on what your
> min parallelism is set to.
>
> In my brief google it looks like dapply is an analogue of mapPartitions.
> Usually the reason to use this is if your map operation has some expensive
> initialization function. For example, you need to open a connection to a
> database so its better to re-use that connection for one partition's
> elements than create it for each element.
>
> What are you trying to accomplish with dapply?
>
> On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang <iam...@gmail.com
> <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote:
>
>> Thanks Pedro,
>>   so to use sparkR dapply on SparkDataFrame, don't we need partition the
>> DataFrame first? the example in doc doesn't seem to do this.
>> Without knowing how it partitioned, how can one write the function to
>> process each partition?
>>
>> Neil
>>
>> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez <ski.rodrig...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');>> wrote:
>>
>>> This should work and I don't think triggers any actions:
>>>
>>> df.rdd.partitions.length
>>>
>>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote:
>>>
>>>> Seems no function does this in Spark 2.0 preview?
>>>>
>>>
>>>
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> |
>>> pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com
> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> |
> pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

Reply via email to