That's how it's intended to work; if it's a problem, you probably need to re-design your computation to not use groupByKey. Usually you can do so.
On Mon, Sep 7, 2015 at 9:02 AM, kaklakariada <christoph.pi...@gmail.com> wrote: > Hi, > > I already posted this question on the users mailing list > (http://apache-spark-user-list.1001560.n3.nabble.com/Using-groupByKey-with-many-values-per-key-td24538.html) > but did not get a reply. Maybe this is the correct forum to ask. > > My problem is, that doing groupByKey().mapToPair() loads all values for a > key into memory which is a problem when the values don't fit into memory. > This was not a problem with Hadoop map/reduce, as the Iterable passed to the > reducer read from disk. > > In Spark, the Iterable passed to mapToPair() is backed by a CompactBuffer > containing all values. > > Is it possible to change this behavior without modifying Spark, or is there > a plan to change this? > > Thank you very much for your help! > Christoph. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/groupByKey-and-keys-with-many-values-tp13985.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org