Pre-processing is major workload before training model. MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is essential for preprocessing and would be great help to the model training.
Take a look at this http://spark.apache.org/docs/latest/mllib-feature-extraction.html 2014-11-21 7:18 GMT+08:00 Jun Yang <yangjun...@gmail.com>: > Guys, > > As to the questions of pre-processing, you could just migrate your logic > to Spark before using K-means. > > I only used Scala on Spark, and haven't used Python binding on Spark, but > I think the basic steps must be the same. > > BTW, if your data set is big with huge sparse dimension feature vector, > K-Means may not works as good as you expected. And I think this is still > the optimization direction of Spark MLLib. > > On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <aminn_...@yahoo.com.invalid > > wrote: > >> Hi there, >> >> I would like to do "text clustering" using k-means and Spark on a >> massive dataset. As you know, before running the k-means, I have to do >> pre-processing methods such as TFIDF and NLTK on my big dataset. The >> following is my code in python : >> >> if __name__ == '__main__': # Cluster a bunch of text documents. import re >> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv' >> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for >> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error >> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, >> e)) remove_spl_char_regex = re.compile('[%s]' % >> re.escape(string.punctuation)) # regex to remove special characters >> remove_num = re.compile('[\d]+') #nltk.download() stop_words= >> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float >> ) a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # >> Remove special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove >> stop words words = a3.split() filter_stop_words = [w for w in words if >> not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in >> filter_stop_words] ws=sorted(stemed) #ws=re.findall(r"\w+", a1) for w in >> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items()) >> >> Can anyone explain to me how can I do the pre-processing step, before >> running the k-means using spark. >> >> >> Best Regards >> >> ....................................................... >> >> Amin Mohebbi >> >> PhD candidate in Software Engineering >> at university of Malaysia >> >> Tel : +60 18 2040 017 >> >> >> >> E-Mail : tp025...@ex.apiit.edu.my >> >> amin_...@me.com >> > > > > -- > yangjun...@gmail.com > http://hi.baidu.com/yjpro >