Pre-processing is major workload before training model.
MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is
essential for preprocessing and would be great help to the model training.

Take a look at this
http://spark.apache.org/docs/latest/mllib-feature-extraction.html

2014-11-21 7:18 GMT+08:00 Jun Yang <yangjun...@gmail.com>:

> Guys,
>
> As to the questions of pre-processing, you could just migrate your logic
> to Spark before using K-means.
>
> I only used Scala on Spark, and haven't used Python binding on Spark, but
> I think the basic steps must be the same.
>
> BTW, if your data set is big with huge sparse dimension feature vector,
> K-Means may not works as good as you expected. And I think this is still
> the optimization direction of Spark MLLib.
>
> On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <aminn_...@yahoo.com.invalid
> > wrote:
>
>> Hi there,
>>
>> I would like to do "text clustering" using  k-means and Spark on a
>> massive dataset. As you know, before running the k-means, I have to do
>> pre-processing methods such as TFIDF and NLTK on my big dataset. The
>> following is my code in python :
>>
>> if __name__ == '__main__': # Cluster a bunch of text documents. import re
>> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
>> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
>> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
>> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
>> e))  remove_spl_char_regex = re.compile('[%s]' %
>> re.escape(string.punctuation)) # regex to remove special characters
>> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
>> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float
>> )  a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) #
>> Remove special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
>> stop words words = a3.split() filter_stop_words = [w for w in words if
>> not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
>> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
>> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>>
>> Can anyone explain to me how can I do the pre-processing step, before
>> running the k-means using spark.
>>
>>
>> Best Regards
>>
>> .......................................................
>>
>> Amin Mohebbi
>>
>> PhD candidate in Software Engineering
>>  at university of Malaysia
>>
>> Tel : +60 18 2040 017
>>
>>
>>
>> E-Mail : tp025...@ex.apiit.edu.my
>>
>>               amin_...@me.com
>>
>
>
>
> --
> yangjun...@gmail.com
> http://hi.baidu.com/yjpro
>

Reply via email to