In case anyone wants to learn about my solution for this: groupByKey is highly inefficient due to the swapping of elements between the different partitions as well as requiring enough mem in each worker to handle the elements for each group. So instead of using groupByKey, I ended up taking the flatMap result, and using subtractByKey in such a way that I ended up with multiple rdds only including the key I wanted; Now I can iterate over each rdd independently and end up with multiple parquets.
Thinking of submitting a splitByKeys() pull request, that would take an array of keys and an rdd, and return an array of rdds each with only one of the keys. Any thoughts around this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-ResultIterable-and-taking-a-list-and-saving-it-into-different-parquet-files-tp22152p22189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org