I will try both and get back to you soon!

Thanks for all your help!

Regards,
Krishna


On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Hi Krishna,
>
> Specifying executor memory in local mode has no effect, because all of
> the threads run inside the same JVM. You can either try
> --driver-memory 60g or start a standalone server.
>
> Best,
> Xiangrui
>
> On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng <men...@gmail.com> wrote:
> > 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
> > take that long, even on a single executor. Besides what Matei
> > suggested, could you also verify the executor memory in
> > http://localhost:4040 in the Executors tab. It is very likely the
> > executors do not have enough memory. In that case, caching may be
> > slower than reading directly from disk. -Xiangrui
> >
> > On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >> Ah, is the file gzipped by any chance? We can’t decompress gzipped
> files in
> >> parallel so they get processed by a single task.
> >>
> >> It may also be worth looking at the application UI (
> http://localhost:4040)
> >> to see 1) whether all the data fits in memory in the Storage tab (maybe
> it
> >> somehow becomes larger, though it seems unlikely that it would exceed
> 20 GB)
> >> and 2) how many parallel tasks run in each iteration.
> >>
> >> Matei
> >>
> >> On Jun 4, 2014, at 6:56 PM, Srikrishna S <srikrishna...@gmail.com>
> wrote:
> >>
> >> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
> >> running to only 10 iterations.
> >>
> >> The MLLib version of logistic regression doesn't seem to use all the
> cores
> >> on my machine.
> >>
> >> Regards,
> >> Krishna
> >>
> >>
> >>
> >> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com>
> >> wrote:
> >>>
> >>> Are you using the logistic_regression.py in examples/src/main/python or
> >>> examples/src/main/python/mllib? The first one is an example of writing
> >>> logistic regression by hand and won’t be as efficient as the MLlib
> one. I
> >>> suggest trying the MLlib one.
> >>>
> >>> You may also want to check how many iterations it runs — by default I
> >>> think it runs 100, which may be more than you need.
> >>>
> >>> Matei
> >>>
> >>> On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com>
> wrote:
> >>>
> >>> > Hi All.,
> >>> >
> >>> > I am new to Spark and I am trying to run LogisticRegression (with
> SGD)
> >>> > using MLLib on a beefy single machine with about 128GB RAM. The
> dataset has
> >>> > about 80M rows with only 4 features so it barely occupies 2Gb on
> disk.
> >>> >
> >>> > I am running the code using all 8 cores with 20G memory using
> >>> > spark-submit --executor-memory 20G --master local[8]
> >>> > logistic_regression.py
> >>> >
> >>> > It seems to take about 3.5 hours without caching and over 5 hours
> with
> >>> > caching.
> >>> >
> >>> > What is the recommended use for Spark on a beefy single machine?
> >>> >
> >>> > Any suggestions will help!
> >>> >
> >>> > Regards,
> >>> > Krishna
> >>> >
> >>> >
> >>> > Code sample:
> >>> >
> >>> >
> ---------------------------------------------------------------------------------------------------------------------
> >>> > # Dataset
> >>> > d = sys.argv[1]
> >>> > data = sc.textFile(d)
> >>> >
> >>> > # Load and parse the data
> >>> > #
> >>> >
> ----------------------------------------------------------------------------------------------------------
> >>> > def parsePoint(line):
> >>> >     values = [float(x) for x in line.split(',')]
> >>> >     return LabeledPoint(values[0], values[1:])
> >>> > _parsedData = data.map(parsePoint)
> >>> > parsedData = _parsedData.cache()
> >>> > results = {}
> >>> >
> >>> > # Spark
> >>> > #
> >>> >
> ----------------------------------------------------------------------------------------------------------
> >>> > start_time = time.time()
> >>> > # Build the gl_model
> >>> > niters = 10
> >>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
> >>> > iterations=niters)
> >>> >
> >>> > # Evaluate the gl_model on training data
> >>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
> >>> > spark_model.predict(p.features)))
> >>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> >>> > float(parsedData.count())
> >>> >
> >>>
> >>
> >>
>

Reply via email to