I have been stuck on this problem for the last few days: I am attempting to run random forest from MLLIB, it gets through most of it, but breaks when doing a mapPartition operation. The following stack trace is shown:
: An error occurred while calling o94.trainRandomForestModel. : java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:2021) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702) at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235) at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291) at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:742) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) It seems to me that it's trying to serialize the mapPartitions closure, but runs out of space doing so. However I don't understand how it could run out of space when I gave the driver ~190GB for a file that's 45MB. I have a cluster setup on AWS such that my master is a r3.8xlarge along with two r3.4xlarge workers. I have the following configurations: spark version: 1.5.0 ----------------------------------- spark.executor.memory 32000m spark.driver.memory 230000m spark.driver.cores 10 spark.executor.cores 5 spark.executor.instances 17 spark.driver.maxResultSize 0 spark.storage.safetyFraction 1 spark.storage.memoryFraction 0.9 spark.storage.shuffleFraction 0.05 spark.default.parallelism 128 The master machine has approximately 240 GB of ram and each worker has about 120GB of ram. I load in a relatively tiny RDD of MLLIB LabeledPoint objects, with each holding sparse vectors inside. This RDD has a total size of roughly 45MB. My sparse vector has a total length of ~15 million while only about 3000 or so are non-zeros. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ClosureCleaner-or-java-serializer-OOM-when-trying-to-grow-tp24796.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org