I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
dataset from json files. When I load the dataset it is about 200gb however
it only creates 60 partitions. I'm trying to repartition to 256 to increase
cpu utilization however when I do that it balloons in memory to way over 2x
t
When I run a groupByKey it seems to create a single tasks after the
groupByKey that never stops executing. I'm loading a smallish json dataset
that is 4 million. This is the code I'm running.
rdd = sql_context.jsonFile(uri)
rdd = rdd.cache()
grouped = rdd.map(lambda row: (row.id, row)).groupByKey