On Fri, 03 Aug 2018 07:49:40 +0000, mausg wrote: > I like to analyse text. my method consisted of something like > words=text.split(), which would split the text into space-seperated > units.
In natural language, words are more complicated than just space-separated units. Some languages don't use spaces as a word delimiter. Some don't use word delimiters at all. Even in English, the we have *compound words* which exist in three forms: - open: "ice cream" - closed: "notebook" - hyphenated: "long-term" Recognising open compound words is difficult. "Real estate" is an open compound word, but "real cheese" and "my estate" are both two words. Another problem for English speakers is deciding whether to treat contractions as a single word, or split them? "don't" --> "do" "n't" "they'll" --> "they" "'ll" Punctuation marks should either be stripped out of sentences before splitting into words, or treated as distinct tokens. We don't want "tokens" and "tokens." to be treated as distinct words, just because one happened to fall at the end of a sentence and one didn't. > then I tried to use the Python NLTK library, which had alot of > features I wanted, but using `word-tokenize' gives a different > answer.- > > What gives?. I'm pretty sure the function isn't called "word-tokenize". That would mean "word subtract tokenize" in Python code. Do you mean word_tokenize? Have you compared the output of the two and looked at how they differ? If there is too much output to compare by eye, you could convert to sets and check the set difference. Or try reading the documentation for word_tokenize: http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.treebank.TreebankWordTokenizer -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list