groupByKey has the numPartitions parameter, you can set it to determine the partition num. if not set, the generated RDD has the same partition num of the previous one
Joe L wrote > I want to apply the following transformations to 60Gbyte data on 7nodes > with 10Gbyte memory. And I am wondering if groupByKey() function returns a > RDD with a single partition for each key? if so, what will happen if the > size of the partition doesn't fit into that particular node? > > rdd = sc.textFile("hdfs//.....").map(parserFunc).groupByKey() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupByKey-returns-a-single-partition-in-a-RDD-tp4264p4307.html Sent from the Apache Spark User List mailing list archive at Nabble.com.