the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree aggregation.
I'm running it via pyspark through YARN -- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng <men...@gmail.com> wrote: > What is the feature dimension? Did you set the driver memory? -Xiangrui > > On Tue, Apr 21, 2015 at 6:59 AM, rok <rokros...@gmail.com> wrote: >> I'm trying to use the StandardScaler in pyspark on a relatively small (a few >> hundred Mb) dataset of sparse vectors with 800k features. The fit method of >> StandardScaler crashes with Java heap space or Direct buffer memory errors. >> There should be plenty of memory around -- 10 executors with 2 cores each >> and 8 Gb per core. I'm giving the executors 9g of memory and have also tried >> lots of overhead (3g), thinking it might be the array creation in the >> aggregators that's causing issues. >> >> The bizarre thing is that this isn't always reproducible -- sometimes it >> actually works without problems. Should I be setting up executors >> differently? >> >> Thanks, >> >> Rok >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org