Re: Logistic Regression MLLib Slow

Srikrishna S Wed, 04 Jun 2014 23:49:18 -0700

Does L-BFSG work with spark 1.0? (see code sample below).

Eventually, I would like to have L-BFGS working but I was facing an issue
where 10 passes over the data was taking forever. I ran spark in standalone
mode and the performance is much better!


Regards,
Krishna

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I am using http://spark.apache.org/docs/latest/mllib-optimization.html

scala> val model = new LogisticRegressionModel(

  Vectors.dense(weightsWithIntercept.toArray.slice(0,
weightsWithIntercept.size - 1)),

  weightsWithIntercept(weightsWithIntercept.size - 1))


val model = new LogisticRegressionModel(

     |   Vectors.dense(weightsWithIntercept.toArray.slice(0,
weightsWithIntercept.size - 1)),

     |   weightsWithIntercept(weightsWithIntercept.size - 1))

<console>:20: error: constructor LogisticRegressionModel in class
LogisticRegressionModel cannot be accessed in class $iwC

       val model = new LogisticRegressionModel(

Based on the documentation, it would seem like LogisticRegressionModel
doesn't have a constructor:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

LogisticRegression *does* have a constructor:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD


On Wed, Jun 4, 2014 at 11:33 PM, DB Tsai <dbt...@stanford.edu> wrote:

> Hi Krishna,
>
> Also, the default optimizer with SGD converges really slow. If you are
> willing to write scala code, there is a full working example for
> training Logistic Regression with L-BFGS (a quasi-Newton method) in
> scala. It converges a way faster than SGD.
>
> See
> http://spark.apache.org/docs/latest/mllib-optimization.html
> for detail.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <men...@gmail.com> wrote:
> > Hi Krishna,
> >
> > Specifying executor memory in local mode has no effect, because all of
> > the threads run inside the same JVM. You can either try
> > --driver-memory 60g or start a standalone server.
> >
> > Best,
> > Xiangrui
> >
> > On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
> >> take that long, even on a single executor. Besides what Matei
> >> suggested, could you also verify the executor memory in
> >> http://localhost:4040 in the Executors tab. It is very likely the
> >> executors do not have enough memory. In that case, caching may be
> >> slower than reading directly from disk. -Xiangrui
> >>
> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >>> Ah, is the file gzipped by any chance? We can’t decompress gzipped
> files in
> >>> parallel so they get processed by a single task.
> >>>
> >>> It may also be worth looking at the application UI (
> http://localhost:4040)
> >>> to see 1) whether all the data fits in memory in the Storage tab
> (maybe it
> >>> somehow becomes larger, though it seems unlikely that it would exceed
> 20 GB)
> >>> and 2) how many parallel tasks run in each iteration.
> >>>
> >>> Matei
> >>>
> >>> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com>
> wrote:
> >>>
> >>> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I
> am
> >>> running to only 10 iterations.
> >>>
> >>> The MLLib version of logistic regression doesn't seem to use all the
> cores
> >>> on my machine.
> >>>
> >>> Regards,
> >>> Krishna
> >>>
> >>>
> >>>
> >>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> Are you using the logistic_regression.py in examples/src/main/python
> or
> >>>> examples/src/main/python/mllib? The first one is an example of writing
> >>>> logistic regression by hand and won’t be as efficient as the MLlib
> one. I
> >>>> suggest trying the MLlib one.
> >>>>
> >>>> You may also want to check how many iterations it runs — by default I
> >>>> think it runs 100, which may be more than you need.
> >>>>
> >>>> Matei
> >>>>
> >>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com>
> wrote:
> >>>>
> >>>> > Hi All.,
> >>>> >
> >>>> > I am new to Spark and I am trying to run LogisticRegression (with
> SGD)
> >>>> > using MLLib on a beefy single machine with about 128GB RAM. The
> dataset has
> >>>> > about 80M rows with only 4 features so it barely occupies 2Gb on
> disk.
> >>>> >
> >>>> > I am running the code using all 8 cores with 20G memory using
> >>>> > spark-submit --executor-memory 20G --master local[8]
> >>>> > logistic_regression.py
> >>>> >
> >>>> > It seems to take about 3.5 hours without caching and over 5 hours
> with
> >>>> > caching.
> >>>> >
> >>>> > What is the recommended use for Spark on a beefy single machine?
> >>>> >
> >>>> > Any suggestions will help!
> >>>> >
> >>>> > Regards,
> >>>> > Krishna
> >>>> >
> >>>> >
> >>>> > Code sample:
> >>>> >
> >>>> >
> ---------------------------------------------------------------------------------------------------------------------
> >>>> > # Dataset
> >>>> > d = sys.argv[1]
> >>>> > data = sc.textFile(d)
> >>>> >
> >>>> > # Load and parse the data
> >>>> > #
> >>>> >
> ----------------------------------------------------------------------------------------------------------
> >>>> > def parsePoint(line):
> >>>> >     values = [float(x) for x in line.split(',')]
> >>>> >     return LabeledPoint(values[0], values[1:])
> >>>> > _parsedData = data.map(parsePoint)
> >>>> > parsedData = _parsedData.cache()
> >>>> > results = {}
> >>>> >
> >>>> > # Spark
> >>>> > #
> >>>> >
> ----------------------------------------------------------------------------------------------------------
> >>>> > start_time = time.time()
> >>>> > # Build the gl_model
> >>>> > niters = 10
> >>>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
> >>>> > iterations=niters)
> >>>> >
> >>>> > # Evaluate the gl_model on training data
> >>>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
> >>>> > spark_model.predict(p.features)))
> >>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> >>>> > float(parsedData.count())
> >>>> >
> >>>>
> >>>
> >>>
>

Re: Logistic Regression MLLib Slow

Reply via email to