You could use a different format and the dataset or dataframe instead of rdd.
> On 14 Apr 2016, at 23:21, Bibudh Lahiri <[email protected]> wrote: > > Hi, > As part of a larger program, I am extracting the distinct values of some > columns of an RDD with 100 million records and 4 columns. I am running Spark > in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB 1333 > MHz DDR3 RAM) with all the 8 cores given to a single worker. So my statement > is something like this: > > age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x: > x[1]).distinct() > > It is taking about 3.8 minutes. It is spawning 89 tasks when dealing with > this RDD because (I guess) the block size is 32 MB, and the entire file is > 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time means it is > processing about 50k records per second per core/task. > > Does this performance look typical or is there room for improvement? > > Thanks > Bibudh > > > > -- > Bibudh Lahiri > Data Scientist, Impetus Technolgoies > 5300 Stevens Creek Blvd > San Jose, CA 95129 > http://knowthynumbers.blogspot.com/ >
