Hi Krishna, It should work, and we use it in production with great success. However, the constructor of LogisticRegressionModel is private[mllib], so you have to write your code, and have the package name under org.apache.spark.mllib instead of using scala console.
Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Jun 4, 2014 at 11:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote: > Does L-BFSG work with spark 1.0? (see code sample below). > > Eventually, I would like to have L-BFGS working but I was facing an issue > where 10 passes over the data was taking forever. I ran spark in standalone > mode and the performance is much better! > > Regards, > Krishna > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > I am using http://spark.apache.org/docs/latest/mllib-optimization.html > > scala> val model = new LogisticRegressionModel( > > Vectors.dense(weightsWithIntercept.toArray.slice(0, > weightsWithIntercept.size - 1)), > > weightsWithIntercept(weightsWithIntercept.size - 1)) > > > val model = new LogisticRegressionModel( > > | Vectors.dense(weightsWithIntercept.toArray.slice(0, > weightsWithIntercept.size - 1)), > > | weightsWithIntercept(weightsWithIntercept.size - 1)) > > <console>:20: error: constructor LogisticRegressionModel in class > LogisticRegressionModel cannot be accessed in class $iwC > > val model = new LogisticRegressionModel( > > Based on the documentation, it would seem like LogisticRegressionModel > doesn't have a constructor: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel > > LogisticRegression *does* have a constructor: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD > > > > On Wed, Jun 4, 2014 at 11:33 PM, DB Tsai <dbt...@stanford.edu> wrote: >> >> Hi Krishna, >> >> Also, the default optimizer with SGD converges really slow. If you are >> willing to write scala code, there is a full working example for >> training Logistic Regression with L-BFGS (a quasi-Newton method) in >> scala. It converges a way faster than SGD. >> >> See >> http://spark.apache.org/docs/latest/mllib-optimization.html >> for detail. >> >> Sincerely, >> >> DB Tsai >> ------------------------------------------------------- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <men...@gmail.com> wrote: >> > Hi Krishna, >> > >> > Specifying executor memory in local mode has no effect, because all of >> > the threads run inside the same JVM. You can either try >> > --driver-memory 60g or start a standalone server. >> > >> > Best, >> > Xiangrui >> > >> > On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't >> >> take that long, even on a single executor. Besides what Matei >> >> suggested, could you also verify the executor memory in >> >> http://localhost:4040 in the Executors tab. It is very likely the >> >> executors do not have enough memory. In that case, caching may be >> >> slower than reading directly from disk. -Xiangrui >> >> >> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com> >> >> wrote: >> >>> Ah, is the file gzipped by any chance? We can’t decompress gzipped >> >>> files in >> >>> parallel so they get processed by a single task. >> >>> >> >>> It may also be worth looking at the application UI >> >>> (http://localhost:4040) >> >>> to see 1) whether all the data fits in memory in the Storage tab >> >>> (maybe it >> >>> somehow becomes larger, though it seems unlikely that it would exceed >> >>> 20 GB) >> >>> and 2) how many parallel tasks run in each iteration. >> >>> >> >>> Matei >> >>> >> >>> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com> >> >>> wrote: >> >>> >> >>> I am using the MLLib one (LogisticRegressionWithSGD) with PySpark. I >> >>> am >> >>> running to only 10 iterations. >> >>> >> >>> The MLLib version of logistic regression doesn't seem to use all the >> >>> cores >> >>> on my machine. >> >>> >> >>> Regards, >> >>> Krishna >> >>> >> >>> >> >>> >> >>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia >> >>> <matei.zaha...@gmail.com> >> >>> wrote: >> >>>> >> >>>> Are you using the logistic_regression.py in examples/src/main/python >> >>>> or >> >>>> examples/src/main/python/mllib? The first one is an example of >> >>>> writing >> >>>> logistic regression by hand and won’t be as efficient as the MLlib >> >>>> one. I >> >>>> suggest trying the MLlib one. >> >>>> >> >>>> You may also want to check how many iterations it runs — by default I >> >>>> think it runs 100, which may be more than you need. >> >>>> >> >>>> Matei >> >>>> >> >>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> >> >>>> wrote: >> >>>> >> >>>> > Hi All., >> >>>> > >> >>>> > I am new to Spark and I am trying to run LogisticRegression (with >> >>>> > SGD) >> >>>> > using MLLib on a beefy single machine with about 128GB RAM. The >> >>>> > dataset has >> >>>> > about 80M rows with only 4 features so it barely occupies 2Gb on >> >>>> > disk. >> >>>> > >> >>>> > I am running the code using all 8 cores with 20G memory using >> >>>> > spark-submit --executor-memory 20G --master local[8] >> >>>> > logistic_regression.py >> >>>> > >> >>>> > It seems to take about 3.5 hours without caching and over 5 hours >> >>>> > with >> >>>> > caching. >> >>>> > >> >>>> > What is the recommended use for Spark on a beefy single machine? >> >>>> > >> >>>> > Any suggestions will help! >> >>>> > >> >>>> > Regards, >> >>>> > Krishna >> >>>> > >> >>>> > >> >>>> > Code sample: >> >>>> > >> >>>> > >> >>>> > --------------------------------------------------------------------------------------------------------------------- >> >>>> > # Dataset >> >>>> > d = sys.argv[1] >> >>>> > data = sc.textFile(d) >> >>>> > >> >>>> > # Load and parse the data >> >>>> > # >> >>>> > >> >>>> > ---------------------------------------------------------------------------------------------------------- >> >>>> > def parsePoint(line): >> >>>> > values = [float(x) for x in line.split(',')] >> >>>> > return LabeledPoint(values[0], values[1:]) >> >>>> > _parsedData = data.map(parsePoint) >> >>>> > parsedData = _parsedData.cache() >> >>>> > results = {} >> >>>> > >> >>>> > # Spark >> >>>> > # >> >>>> > >> >>>> > ---------------------------------------------------------------------------------------------------------- >> >>>> > start_time = time.time() >> >>>> > # Build the gl_model >> >>>> > niters = 10 >> >>>> > spark_model = LogisticRegressionWithSGD.train(parsedData, >> >>>> > iterations=niters) >> >>>> > >> >>>> > # Evaluate the gl_model on training data >> >>>> > labelsAndPreds = parsedData.map(lambda p: (p.label, >> >>>> > spark_model.predict(p.features))) >> >>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / >> >>>> > float(parsedData.count()) >> >>>> > >> >>>> >> >>> >> >>> > >