Hello, I have the following code snippet where I'm trying to list the term frequencies, where first_textand second_text are .tex documents:
from sklearn.feature_extraction.text import CountVectorizer training_documents = (first_text, second_text) vectorizer = CountVectorizer() vectorizer.fit_transform(training_documents)print "Vocabulary:", vectorizer.vocabulary When I run the script, I get the following: File "test.py", line 19, in <module> vectorizer.fit_transform(training_documents) File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform self.fixed_vocabulary_) File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab for feature in analyze(doc): File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 115, in decode doc = doc.decode(self.encoding, self.decode_error) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True)UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: invalid start byte How can I fix this issue? Thanks. -- https://mail.python.org/mailman/listinfo/python-list