from:"insperatum"

Scaling problem in RandomForest?

2015-03-11 Thread insperatum

Hi, the Random Forest implementation (1.2.1) is repeatably crashing when I increase the depth to 20. I generate random synthetic data (36 workers, 1,000,000 examples per worker, 30 features per example) as follows: val data = sc.parallelize(1 to 36, 36).mapPartitionsWithIndex((i, _) => {

Requested array size exceeds VM limit

2015-02-23 Thread insperatum

Hi,I'm using MLLib to train a random forest. It's working fine to depth 15, but if I use depth 20 I get a*java.lang.OutOfMemoryError: Requested array size exceeds VM limit* on the driver, from the collectAsMap operation in DecisionTree.scala, around line 642.It doesn't happen until a good hour into

Caching RDDs with shared memory - bug or feature?

2014-12-09 Thread insperatum

If all RDD elements within a partition contain pointers to a single shared object, Spark persists as expected when the RDD is small. However, if the RDD is more than *200 elements* then Spark reports requiring much more memory than it actually does. This becomes a problem for large RDDs, as Spark r

Re: RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum

Some more details: Adding a println to the function reveals that it is indeed called only once. Furthermore, running: /rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true ...reveals that all 1000 elements do indeed point to the same object, and so the data structure ess

RDD with object shared across elements within a partition. Magic number 200?

2014-11-22 Thread insperatum

Hi all, I am trying to persist a spark RDD in which the elements of each partition all share access to a single, large object. However, this object seems get stored in memory several times. Reducing my problem down to the toy case of just a single partition with only 200 elements: *val* /nElements

Scaling problem in RandomForest?

Requested array size exceeds VM limit

Caching RDDs with shared memory - bug or feature?

Re: RDD with object shared across elements within a partition. Magic number 200?

RDD with object shared across elements within a partition. Magic number 200?

5 matches

Site Navigation

Mail list logo

Footer information