Consider the following left outer join

potentialDailyModificationsRDD = 
reducedDailyPairRDD.leftOuterJoin(baselinePairRDD).partitionBy(new 
HashPartitioner(1024)).persist(StorageLevel.MEMORY_AND_DISK_SER());


Below are the record counts for the RDDs involved
Number of records for reducedDailyPairRDD: 2565206
Number of records for baselinePairRDD: 56102812
Number of records for potentialDailyModificationsRDD: 2570115

Below are the partitioners for the RDDs involved.
Partitioner for reducedDailyPairRDD: Some(org.apache.spark.HashPartitioner@400)
Partitioner for baselinePairRDD: Some(org.apache.spark.HashPartitioner@400)
Partitioner for potentialDailyModificationsRDD: 
Some(org.apache.spark.HashPartitioner@400)


I realize in the above statement that the .partitionBy is probably not needed 
as the underlying RDDs used in the left outer join are already hash partitioned.

My question is how the resulting RDD (potentialDailyModificationsRDD) can end 
up with more records than 
reducedDailyPairRDD.  I would think the number of records in 
potentialDailyModificationsRDD should be 2565206 instead of 2570115.  Am I 
missing something or is this possibly a bug?

I'm using Apache Spark 1.2 on a stand-alone cluster on ec2.  To get the counts 
for the records, I'm using the .count() for the RDD.

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to