Can this performance be improved?

Bibudh Lahiri Thu, 14 Apr 2016 14:22:43 -0700

Hi,
    As part of a larger program, I am extracting the distinct values of
some columns of an RDD with 100 million records and 4 columns. I am running
Spark in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB
1333 MHz DDR3 RAM) with all the 8 cores given to a single worker. So my
statement is something like this:


age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x:
x[1]).distinct()

   It is taking about 3.8 minutes. It is spawning 89 tasks when dealing
with this RDD because (I guess) the block size is 32 MB, and the entire
file is 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time
means it is processing about 50k records per second per core/task.

Does this performance look typical or is there room for improvement?

Thanks
Bibudh



-- 
Bibudh Lahiri
Data Scientist, Impetus Technolgoies
5300 Stevens Creek Blvd
San Jose, CA 95129
http://knowthynumbers.blogspot.com/

Can this performance be improved?

Reply via email to