Repartition Memory Leak

2015-01-04 Thread Brad Willard
I have a 10 node cluster with 600gb of ram. I'm loading a fairly large dataset from json files. When I load the dataset it is about 200gb however it only creates 60 partitions. I'm trying to repartition to 256 to increase cpu utilization however when I do that it balloons in memory to way over 2x t

PySpark Loading Json Following by groupByKey seems broken in spark 1.1.1

2014-12-06 Thread Brad Willard
When I run a groupByKey it seems to create a single tasks after the groupByKey that never stops executing. I'm loading a smallish json dataset that is 4 million. This is the code I'm running. rdd = sql_context.jsonFile(uri) rdd = rdd.cache() grouped = rdd.map(lambda row: (row.id, row)).groupByKey