I have a SchemaRDD with 100 records in 1 partition. We'll call this baseline.
I have a SchemaRDD with 11 records in 1 partition. We'll call this daily.
After a fairly basic join of these two tables
JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch,
daily.version FROM baseline, daily WHERE key=id AND action='u' AND daily.epoch
> baseline.epoch").cache();
I get a new SchemaRDD results with only 6 records (and the RDD has 200
partitions). When the job runs, I can see that 200 tasks were used to do this
join. Does this make sense? I'm currently not doing anything special along the
lines of partitioning (such as hash). Even if 200 tasks would have been
required, since the result is only 6 (shouldn't some of these empty partitions
been 'deleted').
I'm using Apache Spark 1.1 and I'm running this in local mode (localhost[1]).
Any insight would be appreciated.
Thanks.
Darin.