Hi there,
I would like to do "text clustering" using k-means and Spark on a massive
dataset. As you know, before running the k-means, I have to do pre-processing
methods such as TFIDF and NLTK on my big dataset. The following is my code in
python :
|
| if __name__ == '__main__': |
| | # Cluster a bunch of text documents. |
| | import re |
| | import sys |
| | |
| | k = 6 |
| | vocab = {} |
| | xs = [] |
| | ns=[] |
| | cat=[] |
| | filename='2013-01.csv' |
| | with open(filename, newline='') as f: |
| | try: |
| | newsreader = csv.reader(f) |
| | for row in newsreader: |
| | ns.append(row[3]) |
| | cat.append(row[4]) |
| | except csv.Error as e: |
| | sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) |
| | |
| | |
| | remove_spl_char_regex = re.compile('[%s]' %
re.escape(string.punctuation)) # regex to remove special characters |
| | remove_num = re.compile('[\d]+') |
| | #nltk.download() |
| | stop_words=nltk.corpus.stopwords.words('english') |
| | |
| | for a in ns: |
| | x = defaultdict(float) |
| | |
| | |
| | a1 = a.strip().lower() |
| | a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters |
| | a3 = remove_num.sub("", a2) #Remove numbers |
| | #Remove stop words |
| | words = a3.split() |
| | filter_stop_words = [w for w in words if not w in stop_words] |
| | stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] |
| | ws=sorted(stemed) |
| | |
| | |
| | #ws=re.findall(r"\w+", a1) |
| | for w in ws: |
| | vocab.setdefault(w, len(vocab)) |
| | x[vocab[w]] += 1 |
| | xs.append(x.items()) |
| |
Can anyone explain to me how can I do the pre-processing step, before running
the k-means using spark.
Best Regards
.......................................................
Amin Mohebbi
PhD candidate in Software Engineering
at university of Malaysia
Tel : +60 18 2040 017
E-Mail : [email protected]
[email protected]