Spark SQL and confused about number of partitions/tasks to do a simple join.

Darin McBeath Wed, 29 Oct 2014 10:57:06 -0700

I have a SchemaRDD with 100 records in 1 partition.  We'll call this baseline.
I have a SchemaRDD with 11 records in 1 partition.  We'll call this daily.
After a fairly basic join of these two tables
JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch, 
daily.version FROM baseline, daily  WHERE key=id AND action='u' AND daily.epoch 
> baseline.epoch").cache();


I get a new SchemaRDD results with only 6 records (and the RDD has 200 
partitions).  When the job runs, I can see that 200 tasks were used to do this 
join.  Does this make sense? I'm currently not doing anything special along the 
lines of partitioning (such as hash).  Even if 200 tasks would have been 
required, since the result is only 6 (shouldn't some of these empty partitions 
been 'deleted').
I'm using Apache Spark 1.1 and I'm running this in local mode (localhost[1]).
Any insight would be appreciated.
Thanks.
Darin.

Spark SQL and confused about number of partitions/tasks to do a simple join.

Reply via email to