Hi, As part of a larger program, I am extracting the distinct values of some columns of an RDD with 100 million records and 4 columns. I am running Spark in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB 1333 MHz DDR3 RAM) with all the 8 cores given to a single worker. So my statement is something like this:
age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x: x[1]).distinct() It is taking about 3.8 minutes. It is spawning 89 tasks when dealing with this RDD because (I guess) the block size is 32 MB, and the entire file is 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time means it is processing about 50k records per second per core/task. Does this performance look typical or is there room for improvement? Thanks Bibudh -- Bibudh Lahiri Data Scientist, Impetus Technolgoies 5300 Stevens Creek Blvd San Jose, CA 95129 http://knowthynumbers.blogspot.com/