You could use a different format and the dataset or dataframe instead of rdd.

> On 14 Apr 2016, at 23:21, Bibudh Lahiri <[email protected]> wrote:
> 
> Hi,
>     As part of a larger program, I am extracting the distinct values of some 
> columns of an RDD with 100 million records and 4 columns. I am running Spark 
> in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB 1333 
> MHz DDR3 RAM) with all the 8 cores given to a single worker. So my statement 
> is something like this:
> 
> age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x: 
> x[1]).distinct()
> 
>    It is taking about 3.8 minutes. It is spawning 89 tasks when dealing with 
> this RDD because (I guess) the block size is 32 MB, and the entire file is 
> 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time means it is 
> processing about 50k records per second per core/task.
> 
>    Does this performance look typical or is there room for improvement?
> 
> Thanks
>             Bibudh
> 
>    
> 
> -- 
> Bibudh Lahiri
> Data Scientist, Impetus Technolgoies
> 5300 Stevens Creek Blvd
> San Jose, CA 95129
> http://knowthynumbers.blogspot.com/
>  

Reply via email to