I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
running to only 10 iterations.

The MLLib version of logistic regression doesn't seem to use all the cores
on my machine.

Regards,
Krishna



On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Are you using the logistic_regression.py in examples/src/main/python or
> examples/src/main/python/mllib? The first one is an example of writing
> logistic regression by hand and won’t be as efficient as the MLlib one. I
> suggest trying the MLlib one.
>
> You may also want to check how many iterations it runs — by default I
> think it runs 100, which may be more than you need.
>
> Matei
>
> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote:
>
> > Hi All.,
> >
> > I am new to Spark and I am trying to run LogisticRegression (with SGD)
> using MLLib on a beefy single machine with about 128GB RAM. The dataset has
> about 80M rows with only 4 features so it barely occupies 2Gb on disk.
> >
> > I am running the code using all 8 cores with 20G memory using
> > spark-submit --executor-memory 20G --master local[8]
> logistic_regression.py
> >
> > It seems to take about 3.5 hours without caching and over 5 hours with
> caching.
> >
> > What is the recommended use for Spark on a beefy single machine?
> >
> > Any suggestions will help!
> >
> > Regards,
> > Krishna
> >
> >
> > Code sample:
> >
> ---------------------------------------------------------------------------------------------------------------------
> > # Dataset
> > d = sys.argv[1]
> > data = sc.textFile(d)
> >
> > # Load and parse the data
> > #
> ----------------------------------------------------------------------------------------------------------
> > def parsePoint(line):
> >     values = [float(x) for x in line.split(',')]
> >     return LabeledPoint(values[0], values[1:])
> > _parsedData = data.map(parsePoint)
> > parsedData = _parsedData.cache()
> > results = {}
> >
> > # Spark
> > #
> ----------------------------------------------------------------------------------------------------------
> > start_time = time.time()
> > # Build the gl_model
> > niters = 10
> > spark_model = LogisticRegressionWithSGD.train(parsedData,
> iterations=niters)
> >
> > # Evaluate the gl_model on training data
> > labelsAndPreds = parsedData.map(lambda p: (p.label,
> spark_model.predict(p.features)))
> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> float(parsedData.count())
> >
>
>

Reply via email to