Le vendredi 27 janvier 2012 à 09:50 -0500, Michael Friendly a écrit : > I tried making a wordcloud of Obama's State of the Union address using > the tm package to process the text > > sotu <- scan(file="c:/R/data/sotu2012.txt", what="character") > sotu <- tolower(sotu) > corp <-Corpus(VectorSource(paste(sotu, collapse=" "))) > > corp <- tm_map(corp, removePunctuation) > corp <- tm_map(corp, stemDocument) > corp <- tm_map(corp, function(x)removeWords(x,stopwords())) > tdm <- TermDocumentMatrix(corp) > m <- as.matrix(tdm) > v <- sort(rowSums(m),decreasing=TRUE) > d <- data.frame(word = names(v),freq=v) > > wordcloud(d$word,d$freq) > > I ended up with a large number of contractions that were split at the > "’" character, e.g., "don’t" --> "don'" > e.g., > > > sotu[grep("’", sotu)] > [1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t" > [6] "we’re" "aren’t" "we’ve" "patton’s" "what’s" > [11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t" > [16] "we’ve" "we’ve" "we’ve" "i’m" "that’s" > [21] "world’s" "what’s" "can’t" "that’s" "it’s" > [26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re" > [31] "you’re" "it’s" "i’ll" "we’re" "don’t" > [36] "we’ve" "it’s" "it’s" "it’s" "they’re" > ... > [201] "didn’t" "bush’s" "didn’t" "can’t" "there’s" > [206] "i’m" "other’s" "we’re" > > > > NB: What appears as the ' character above actually the character hex 92, > not hex 27 on my Windows system. > > This should be a common problem in text processing, but I don't see a > transformation in the tm package that > handles this nicely. Is there something I've missed? What result would you expect? As I see it, ideally, removePunctuation() would remove these apostrophes. Looks like it doesn't; the code is:
removePunctuation <- function(x) UseMethod("removePunctuation", x) removePunctuation.PlainTextDocument <- function(x) gsub("[[:punct:]]+", " ", x) And ?regexp says: ‘[:punct:]’ Punctuation characters: ‘! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~’. Maybe the ’ apostrophe should be added to the list? (FWIW, it's the "real" character for apostrophe in Unicode.) I discussed a related issue about apostrophes with Ingo Feinerer and Kurt Hornik: in French, we'd need apostrophes (of any type, ' or ’) to mark a separation between words, instead of concatenating the two parts surrounding it. The conclusion was that a language-specific processor was required (languages with non-latin alphabet have many more diacritic characters we don't even know about). In English, I suspect it might be interesting to detect forms like "'re" or "'nt" and replace them with their full equivalent, i.e. "are" and "not"; OTOH, genitive forms would probably better be removed (at least by default). In the short term, Tyler's solution will work, but beware that "we're" will become "were" if you remove punctuation ;-). An alternative is to replace apostrophes with spaces so that suffixes are considered as separate words (that's what I do in French ATM). Hope this helps ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.