Hi All., I am new to Spark and I am trying to run LogisticRegression (with SGD) using MLLib on a beefy single machine with about 128GB RAM. The dataset has about 80M rows with only 4 features so it barely occupies 2Gb on disk.
I am running the code using all 8 cores with 20G memory using spark-submit --executor-memory 20G --master local[8] logistic_regression.py It seems to take about 3.5 hours without caching and over 5 hours with caching. What is the recommended use for Spark on a beefy single machine? Any suggestions will help! Regards, Krishna Code sample: --------------------------------------------------------------------------------------------------------------------- # Dataset d = sys.argv[1] data = sc.textFile(d) # Load and parse the data # ---------------------------------------------------------------------------------------------------------- def parsePoint(line): values = [float(x) for x in line.split(',')] return LabeledPoint(values[0], values[1:]) _parsedData = data.map(parsePoint) parsedData = _parsedData.cache() results = {} # Spark # ---------------------------------------------------------------------------------------------------------- start_time = time.time() # Build the gl_model niters = 10 spark_model = LogisticRegressionWithSGD.train(parsedData, iterations=niters) # Evaluate the gl_model on training data labelsAndPreds = parsedData.map(lambda p: (p.label, spark_model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())