On 2018-08-07, Stefan Ram <r...@zedat.fu-berlin.de> wrote: > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes: >>In natural language, words are more complicated than just space-separated >>units. Some languages don't use spaces as a word delimiter. > > Even above, the word »units« is neither directly preceded > nor directly followed by a space. > > In the end, one can make an arbitrary choice about where one > wants to place the border between syntax and morphology. > > For the case of English, I can define a word to be a > sequence of letters (including the apostrophe), that is > sorrounded by non-letter characters. > >>Recognising open compound words is difficult. "Real estate" is an open >>compound word, but "real cheese" and "my estate" are both two words. > > This is just a part of the more general problem to parse and > interpret a sentence. It is not more difficult than the > interpretation of other pairs of words in a sentence. > >>Another problem for English speakers is deciding whether to treat >>contractions as a single word, or split them? >>"don't" --> "do" "n't" >>"they'll" --> "they" "'ll" > > They are a single word by my definition. But this is just > the surface of the input. The input could be translated into > a "deep-structure" intermediate language that than splits > some source words into several units or joins some source > words into a single unit. > >>Punctuation marks should either be stripped out of sentences before >>splitting into words, or treated as distinct tokens. We don't want >>"tokens" and "tokens." to be treated as distinct words, just because one >>happened to fall at the end of a sentence and one didn't. > > Yes, but this is quite trivial compared to the problem > of parsing and interpreting a natural-language sentence. >
Thanks all for the replies. It seems that I do not really need NLTK. split() will do me. Again Thanks -- m...@ireland.com Will Rant For Food -- https://mail.python.org/mailman/listinfo/python-list