Consider the following left outer join potentialDailyModificationsRDD = reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER());
Below are the record counts for the RDDs involved Number of records for reducedDailyPairRDD: 2565206 Number of records for baselinePairRDD: 56102812 Number of records for potentialDailyModificationsRDD: 2570115 Below are the partitioners for the RDDs involved. Partitioner for reducedDailyPairRDD: Some(org.apache.spark.HashPartitioner@400) Partitioner for baselinePairRDD: Some(org.apache.spark.HashPartitioner@400) Partitioner for potentialDailyModificationsRDD: Some(org.apache.spark.HashPartitioner@400) I realize in the above statement that the .partitionBy is probably not needed as the underlying RDDs used in the left outer join are already hash partitioned. My question is how the resulting RDD (potentialDailyModificationsRDD) can end up with more records than reducedDailyPairRDD. I would think the number of records in potentialDailyModificationsRDD should be 2565206 instead of 2570115. Am I missing something or is this possibly a bug? I'm using Apache Spark 1.2 on a stand-alone cluster on ec2. To get the counts for the records, I'm using the .count() for the RDD. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
