In case anyone wants to learn about my solution for this:
groupByKey is highly inefficient due to the swapping of elements between the
different partitions as well as requiring enough mem in each worker to
handle the elements for each group. So instead of using groupByKey, I ended
up taking the flatMap result, and using subtractByKey in such a way that I
ended up with multiple rdds only including the key I wanted; Now I can
iterate over each rdd independently and end up with multiple parquets.

Thinking of submitting a splitByKeys() pull request, that would take an
array of keys and an rdd, and return an array of rdds each with only one of
the keys. Any thoughts around this?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-ResultIterable-and-taking-a-list-and-saving-it-into-different-parquet-files-tp22152p22189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to