Hi All.,
I am new to Spark and I am trying to run LogisticRegression (with SGD)
using MLLib on a beefy single machine with about 128GB RAM. The dataset has
about 80M rows with only 4 features so it barely occupies 2Gb on disk.
I am running the code using all 8 cores with 20G memory using
spark-submit --executor-memory 20G --master local[8] logistic_regression.py
It seems to take about 3.5 hours without caching and over 5 hours with
caching.
What is the recommended use for Spark on a beefy single machine?
Any suggestions will help!
Regards,
Krishna
Code sample:
---------------------------------------------------------------------------------------------------------------------
# Dataset
d = sys.argv[1]
data = sc.textFile(d)
# Load and parse the data
#
----------------------------------------------------------------------------------------------------------
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return LabeledPoint(values[0], values[1:])
_parsedData = data.map(parsePoint)
parsedData = _parsedData.cache()
results = {}
# Spark
#
----------------------------------------------------------------------------------------------------------
start_time = time.time()
# Build the gl_model
niters = 10
spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters)
# Evaluate the gl_model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label,
spark_model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())