Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I
increase the depth to 20. I generate random synthetic data (36 workers,
1,000,000 examples per worker, 30 features per example) as follows:
val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) =>
{
Hi,I'm using MLLib to train a random forest. It's working fine to depth 15,
but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array
size exceeds VM limit* on the driver, from the collectAsMap operation in
DecisionTree.scala, around line 642.It doesn't happen until a good hour into
If all RDD elements within a partition contain pointers to a single shared
object, Spark persists as expected when the RDD is small. However, if the
RDD is more than *200 elements* then Spark reports requiring much more
memory than it actually does. This becomes a problem for large RDDs, as
Spark r
Some more details: Adding a println to the function reveals that it is indeed
called only once. Furthermore, running:
/rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true
...reveals that all 1000 elements do indeed point to the same object,
and so the data structure ess
Hi all,
I am trying to persist a spark RDD in which the elements of each partition
all share access to a single, large object. However, this object seems get
stored in memory several times. Reducing my problem down to the toy case of
just a single partition with only 200 elements:
*val* /nElements