the feature dimension is 800k.

yes, I believe the driver memory is likely the problem since it doesn't crash 
until the very last part of the tree aggregation. 

I'm running it via pyspark through YARN -- I have to run in client mode so I 
can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory 
and overhead parameters but it doesn't seem to have an effect. 

Thanks,

Rok

On Apr 23, 2015, at 7:47 AM, Xiangrui Meng <men...@gmail.com> wrote:

> What is the feature dimension? Did you set the driver memory? -Xiangrui
> 
> On Tue, Apr 21, 2015 at 6:59 AM, rok <rokros...@gmail.com> wrote:
>> I'm trying to use the StandardScaler in pyspark on a relatively small (a few
>> hundred Mb) dataset of sparse vectors with 800k features. The fit method of
>> StandardScaler crashes with Java heap space or Direct buffer memory errors.
>> There should be plenty of memory around -- 10 executors with 2 cores each
>> and 8 Gb per core. I'm giving the executors 9g of memory and have also tried
>> lots of overhead (3g), thinking it might be the array creation in the
>> aggregators that's causing issues.
>> 
>> The bizarre thing is that this isn't always reproducible -- sometimes it
>> actually works without problems. Should I be setting up executors
>> differently?
>> 
>> Thanks,
>> 
>> Rok
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to