Hello, I am trying to run a python script that makes use of the kmeans MLIB
and I'm not getting anywhere. I'm using an c3.xlarge instance as master,
and 10 c3.large instances as slaves. In the code I make a map of a 600MB
csv file in S3, where each row has 128 integer columns. The problem is that
around the TID7 my slave stops responding, and I can not finish my
processing. Could you help me with this problem? I sending my script
attached for review.

Thank you,
Gilberto
#!/usr/bin/env python
# coding: utf-8
from pyspark import SparkConf, SparkContext
from pyspark.mllib.clustering import KMeans
from numpy import array
from math import sqrt

conf = (SparkConf()
         .setMaster("spark://ec2-54-207-84-167.sa-east-1.compute.amazonaws.com:7077")
         .setAppName("Kmeans App")
		 .set("spark.akka.frameSize", "20")
		 .set("spark.executor.memory", "2048m"))
sc = SparkContext(conf = conf)

# Load and parse the data
data = sc.textFile("s3n://boomage-npc-production/general_files/features/geral.csv")
#data = sc.textFile("s3n://boo-kmeans-test/clustering/240x.csv")

parsedData = data.map(lambda line: array([int(x) for x in line.split(',')]))

# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 1000, maxIterations=10,
        runs=10, initializationMode="random")


print "{0} = {1}".format("Boooooooooooo", array(clusters.clusterCenters).shape)
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to