I am running Spark over Cassandra to process a single table.
My task reads a single days' worth of data from the table and performs 50 group
by and distinct operations, counting distinct userIds by different grouping
keys.
My code looks like this:
JavaRdd<Row> rdd = sc.parallelize().mapPartitions().cache() // reads the
data from the table
for each groupingKey {
JavaPairRdd<GroupingKey, UserId> groupByRdd = rdd.mapToPair();
JavaPairRDD<GroupingKey, Long> countRdd =
groupByRdd.distinct().mapToPair().reduceByKey() // counts distinct values per
grouping key
}
The distinct() stage takes about 2 minutes for every groupByValue, and my task
takes well over an hour to complete.
My cluster has 4 nodes and 30 GB of RAM per Spark process, the table size is 4
GB.
How can I identify the bottleneck more accurately? Is it caused by shuffling
data?
How can I improve the performance?
Thanks,
Oded