Hello,

I am running RandomForest from mllib on a data-set which has very-high
dimensional data (~50k dimensions).

I get the following stack trace:

16/01/22 21:52:48 ERROR ApplicationMaster: User class threw exception:
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at
org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:624)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at
org.apache.spark.mllib.tree.RandomForest$.trainClassifier(RandomForest.scala:291)
at
org.apache.spark.mllib.tree.RandomForest.trainClassifier(RandomForest.scala)
at com.miovision.cv.spark.CifarTrainer.main(CifarTrainer.java:108)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525



I have determined that the problem is that when the ClosureCleaner checks
that a closure is serializable (ensureSerializable), it serializes the
closure to an underlying java bytebuffer, which is limited to about 2gb
(due to signed 32-bit int).

I believe that the closure has grown very large due to the high number of
features (dimensions), and the statistics that must be collected for them.


Does anyone know if there is a way that I can make mllib's randomforest
implementation limit the size here such that it will not exceed 2gb
serialized-closures, or alternatively is there a way to allow spark to work
with such a large closure?


I am running this training on a very large cluster of very large machines,
so RAM is not the problem here.  Problem is java's 32-bit limit on array
sizes.


Thanks,

Joel

Reply via email to