Hello, I am trying to run a python script that makes use of the kmeans MLIB and I'm not getting anywhere. I'm using an c3.xlarge instance as master, and 10 c3.large instances as slaves. In the code I make a map of a 600MB csv file in S3, where each row has 128 integer columns. The problem is that around the TID7 my slave stops responding, and I can not finish my processing. Could you help me with this problem? I sending my script attached for review.
Thank you, Gilberto
#!/usr/bin/env python # coding: utf-8 from pyspark import SparkConf, SparkContext from pyspark.mllib.clustering import KMeans from numpy import array from math import sqrt conf = (SparkConf() .setMaster("spark://ec2-54-207-84-167.sa-east-1.compute.amazonaws.com:7077") .setAppName("Kmeans App") .set("spark.akka.frameSize", "20") .set("spark.executor.memory", "2048m")) sc = SparkContext(conf = conf) # Load and parse the data data = sc.textFile("s3n://boomage-npc-production/general_files/features/geral.csv") #data = sc.textFile("s3n://boo-kmeans-test/clustering/240x.csv") parsedData = data.map(lambda line: array([int(x) for x in line.split(',')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 1000, maxIterations=10, runs=10, initializationMode="random") print "{0} = {1}".format("Boooooooooooo", array(clusters.clusterCenters).shape)
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org