I have a SchemaRDD with 100 records in 1 partition. We'll call this baseline. I have a SchemaRDD with 11 records in 1 partition. We'll call this daily. After a fairly basic join of these two tables JavaSchemaRDD results = sqlContext.sql("SELECT id, action, daily.epoch, daily.version FROM baseline, daily WHERE key=id AND action='u' AND daily.epoch > baseline.epoch").cache();
I get a new SchemaRDD results with only 6 records (and the RDD has 200 partitions). When the job runs, I can see that 200 tasks were used to do this join. Does this make sense? I'm currently not doing anything special along the lines of partitioning (such as hash). Even if 200 tasks would have been required, since the result is only 6 (shouldn't some of these empty partitions been 'deleted'). I'm using Apache Spark 1.1 and I'm running this in local mode (localhost[1]). Any insight would be appreciated. Thanks. Darin.