I am using the MLLib one (LogisticRegressionWithSGD) with PySpark. I am running to only 10 iterations.
The MLLib version of logistic regression doesn't seem to use all the cores on my machine. Regards, Krishna On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > Are you using the logistic_regression.py in examples/src/main/python or > examples/src/main/python/mllib? The first one is an example of writing > logistic regression by hand and won’t be as efficient as the MLlib one. I > suggest trying the MLlib one. > > You may also want to check how many iterations it runs — by default I > think it runs 100, which may be more than you need. > > Matei > > On Jun 4, 2014, at 5:47 PM, Srikrishna S <srikrishna...@gmail.com> wrote: > > > Hi All., > > > > I am new to Spark and I am trying to run LogisticRegression (with SGD) > using MLLib on a beefy single machine with about 128GB RAM. The dataset has > about 80M rows with only 4 features so it barely occupies 2Gb on disk. > > > > I am running the code using all 8 cores with 20G memory using > > spark-submit --executor-memory 20G --master local[8] > logistic_regression.py > > > > It seems to take about 3.5 hours without caching and over 5 hours with > caching. > > > > What is the recommended use for Spark on a beefy single machine? > > > > Any suggestions will help! > > > > Regards, > > Krishna > > > > > > Code sample: > > > --------------------------------------------------------------------------------------------------------------------- > > # Dataset > > d = sys.argv[1] > > data = sc.textFile(d) > > > > # Load and parse the data > > # > ---------------------------------------------------------------------------------------------------------- > > def parsePoint(line): > > values = [float(x) for x in line.split(',')] > > return LabeledPoint(values[0], values[1:]) > > _parsedData = data.map(parsePoint) > > parsedData = _parsedData.cache() > > results = {} > > > > # Spark > > # > ---------------------------------------------------------------------------------------------------------- > > start_time = time.time() > > # Build the gl_model > > niters = 10 > > spark_model = LogisticRegressionWithSGD.train(parsedData, > iterations=niters) > > > > # Evaluate the gl_model on training data > > labelsAndPreds = parsedData.map(lambda p: (p.label, > spark_model.predict(p.features))) > > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / > float(parsedData.count()) > > > >