Thanks for the pointer Peter, that change will indeed fix this bug and it looks like it will make it into the upcoming 1.3.0 release.
@Evan, for reference, completeness and posterity: > Just to be clear - you're currently calling .persist() before you pass data > to LogisticRegressionWithLBFGS? No. I added persist in GeneralizedLinearAlgorithm right before the `data` RDD goes into optimizer (LBFGS in our case). See here: https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204 > Also - can you give some parameters about the problem/cluster size you're > solving this on? How much memory per node? How big are n and d, what is its > sparsity (if any) and how many iterations are you running for? Is 0:45 the > per-iteration time or total time for some number of iterations? The vector is very sparse (few hundred entries) but 2.5M in size. The dataset is about 30M examples to learn from. 16x machines, 64GB memory, 32-cores. Josh On 17 February 2015 at 17:31, Peter Rudenko <petro.rude...@gmail.com> wrote: > It's fixed today: https://github.com/apache/spark/pull/4593 > > Thanks, > Peter Rudenko > > On 2015-02-17 18:25, Evan R. Sparks wrote: >> >> Josh - thanks for the detailed write up - this seems a little funny to me. >> I agree that with the current code path there is extra work being done >> than >> needs to be (e.g. the features are re-scaled at every iteration, but the >> relatively costly process of fitting the StandardScaler should not be >> re-done at each iteration. Instead, at each iteration, all points are >> re-scaled according to the pre-computed standard-deviations in the >> StandardScalerModel, and then an intercept is appended. >> >> Just to be clear - you're currently calling .persist() before you pass >> data >> to LogisticRegressionWithLBFGS? >> >> Also - can you give some parameters about the problem/cluster size you're >> solving this on? How much memory per node? How big are n and d, what is >> its >> sparsity (if any) and how many iterations are you running for? Is 0:45 the >> per-iteration time or total time for some number of iterations? >> >> A useful test might be to call GeneralizedLinearAlgorithm >> useFeatureScaling >> set to false (and maybe also addIntercept set to false) on persisted data, >> and see if you see the same performance wins. If that's the case we've >> isolated the issue and can start profiling to see where all the time is >> going. >> >> It would be great if you can open a JIRA. >> >> Thanks! >> >> >> >> On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins <j...@soundcloud.com> wrote: >> >>> Cross-posting as I got no response on the users mailing list last >>> week. Any response would be appreciated :) >>> >>> Josh >>> >>> >>> ---------- Forwarded message ---------- >>> From: Josh Devins <j...@soundcloud.com> >>> Date: 9 February 2015 at 15:59 >>> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm >>> To: "u...@spark.apache.org" <u...@spark.apache.org> >>> >>> >>> I've been looking into a performance problem when using >>> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm). >>> Here's an outline of what I've figured out so far and it would be >>> great to get some confirmation of the problem, some input on how >>> wide-spread this problem might be and any ideas on a nice way to fix >>> this. >>> >>> Context: >>> - I will reference `branch-1.1` as we are currently on v1.1.1 however >>> this appears to still be a problem on `master` >>> - The cluster is run on YARN, on bare-metal hardware (no VMs) >>> - I've not filed a Jira issue yet but can do so >>> - This problem affects all algorithms based on >>> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so >>> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS) >>> >>> Problem Outline: >>> - Starting at GLA line 177 >>> ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177 >>> ), >>> a feature scaler is created using the `input` RDD >>> - Refer next to line 186 which then maps over the `input` RDD and >>> produces a new `data` RDD >>> ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186 >>> ) >>> - If you are using feature scaling or adding intercepts, the user >>> `input` RDD has been mapped over *after* the user has persisted it >>> (hopefully) and *before* going into the (iterative) optimizer on line >>> 204 ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204 >>> ) >>> - Since the RDD `data` that is iterated over in the optimizer is >>> unpersisted, when we are running the cost function in the optimizer >>> (e.g. LBFGS -- >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198 >>> ), >>> the map phase will actually first go back and rerun the feature >>> scaling (map tasks on `input`) and then map with the cost function >>> (two maps pipelined into one stage) >>> - As a result, parts of the StandardScaler will actually be run again >>> (perhaps only because the variable is `lazy`?) and this can be costly, >>> see line 84 ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84 >>> ) >>> - For small datasets and/or few iterations, this is not really a >>> problem, however we found that by adding a `data.persist()` right >>> before running the optimizer, we went from map iterations in the >>> optimizer that went from 5:30 down to 0:45 >>> >>> I had a very tough time coming up with a nice way to describe my >>> debugging sessions in an email so I hope this gets the main points >>> across. Happy to clarify anything if necessary (also by live >>> debugging/Skype/phone if that's helpful). >>> >>> Thanks, >>> >>> Josh >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org