Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau
p. > > > > Thanks, > > > > Piero > > > > *From:* DB Tsai [mailto:dbt...@dbtsai.com > ] > *Sent:* Wednesday, June 03, 2015 10:33 PM > *To:* Piero Cinquegrana > *Cc:* user@spark.apache.org > > *Subject:* Re: Standard Scaler taking 1.5hrs > > >

RE: Standard Scaler taking 1.5hrs

2015-06-04 Thread Piero Cinquegrana
each step. Thanks, Piero From: DB Tsai [mailto:dbt...@dbtsai.com] Sent: Wednesday, June 03, 2015 10:33 PM To: Piero Cinquegrana Cc: user@spark.apache.org Subject: Re: Standard Scaler taking 1.5hrs Can you do count() before fit to force materialize the RDD? I think something before fit is slow

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Can you do count() before fit to force materialize the RDD? I think something before fit is slow. On Wednesday, June 3, 2015, Piero Cinquegrana wrote: > The fit part is very slow, transform not at all. > > The number of partitions was 210 vs number of executors 80. > > Spark 1.4 sounds great

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread Piero Cinquegrana
The fit part is very slow, transform not at all. The number of partitions was 210 vs number of executors 80. Spark 1.4 sounds great but as my company is using Qubole we are dependent upon them to upgrade from version 1.3.1. Until that happens, can you think of any other reasons as to why it cou

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai
Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but very small, and transform doesn't do shuffle. I guess you don't have enough partition, so please repartition your input dataset to a number at least larger than the # of executors you have. In Spark 1.4's new ML pipeline a