Hello, I am running RandomForest from mllib on a data-set which has very-high dimensional data (~50k dimensions).
I get the following stack trace: 16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:2021) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702) at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:624) at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235) at org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291) at org.apache.spark.mllib.tree.RandomForest.trainClassifier(RandomForest.scala) at com.miovision.cv.spark.CifarTrainer.main(CifarTrainer.java:108) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525 I have determined that the problem is that when the ClosureCleaner checks that a closure is serializable (ensureSerializable), it serializes the closure to an underlying java bytebuffer, which is limited to about 2gb (due to signed 32-bit int). I believe that the closure has grown very large due to the high number of features (dimensions), and the statistics that must be collected for them. Does anyone know if there is a way that I can make mllib's randomforest implementation limit the size here such that it will not exceed 2gb serialized-closures, or alternatively is there a way to allow spark to work with such a large closure? I am running this training on a very large cluster of very large machines, so RAM is not the problem here. Problem is java's 32-bit limit on array sizes. Thanks, Joel