Hi All.,

I am new to Spark and I am trying to run LogisticRegression (with SGD)
using MLLib on a beefy single machine with about 128GB RAM. The dataset has
about 80M rows with only 4 features so it barely occupies 2Gb on disk.

I am running the code using all 8 cores with 20G memory using
spark-submit --executor-memory 20G --master local[8] logistic_regression.py

It seems to take about 3.5 hours without caching and over 5 hours with
caching.

What is the recommended use for Spark on a beefy single machine?

Any suggestions will help!

Regards,
Krishna


Code sample:
---------------------------------------------------------------------------------------------------------------------
# Dataset
d = sys.argv[1]
data = sc.textFile(d)

# Load and parse the data
#
----------------------------------------------------------------------------------------------------------
def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[0], values[1:])
_parsedData = data.map(parsePoint)
parsedData = _parsedData.cache()
results = {}

# Spark
#
----------------------------------------------------------------------------------------------------------
start_time = time.time()
# Build the gl_model
niters = 10
spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters)

# Evaluate the gl_model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label,
spark_model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())

Reply via email to