One example for using dapply is to apply linear regression on many small partitions. I think red can do that with parallelism too but heard dapply is faster.
On Friday, July 22, 2016, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > I haven't used SparkR/R before, only Scala/Python APIs so I don't know for > sure. > > I am guessing if things are in a DataFrame they were read either from some > disk source (S3/HDFS/file/etc) or they were created from parallelize. If > you are using the first, Spark will for the most part choose a reasonable > number of partitions while for parallelize I think it depends on what your > min parallelism is set to. > > In my brief google it looks like dapply is an analogue of mapPartitions. > Usually the reason to use this is if your map operation has some expensive > initialization function. For example, you need to open a connection to a > database so its better to re-use that connection for one partition's > elements than create it for each element. > > What are you trying to accomplish with dapply? > > On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang <iam...@gmail.com > <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote: > >> Thanks Pedro, >> so to use sparkR dapply on SparkDataFrame, don't we need partition the >> DataFrame first? the example in doc doesn't seem to do this. >> Without knowing how it partitioned, how can one write the function to >> process each partition? >> >> Neil >> >> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez <ski.rodrig...@gmail.com >> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');>> wrote: >> >>> This should work and I don't think triggers any actions: >>> >>> df.rdd.partitions.length >>> >>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote: >>> >>>> Seems no function does this in Spark 2.0 preview? >>>> >>> >>> >>> >>> -- >>> Pedro Rodriguez >>> PhD Student in Distributed Machine Learning | CU Boulder >>> UC Berkeley AMPLab Alumni >>> >>> ski.rodrig...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> | >>> pedrorodriguez.io | 909-353-4423 >>> Github: github.com/EntilZha | LinkedIn: >>> https://www.linkedin.com/in/pedrorodriguezscience >>> >>> >> > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com > <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> | > pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > >