Re: [R] tm package: handling contractions

Tyler Rinker Fri, 27 Jan 2012 09:17:53 -0800


This may not be the answer to your problem but you could gsub out the 
"pretty apostrophe" for the one tm recognizes.  Also note that this may be due 
to your use of word which automatically uses the "pretty apostrophe".  The 
default setting on MS word can be altered 
to alleviate this.#===============================
#using gsub
x <-  "I didn’t know!"x <- gsub("’", "'", 
x)removePunctuation(x)#===============================#You could make that into 
a function and apply it to the corpus with tm_map
exchanger <- function(x) gsub("’", "'", x)corp <- tm_map(corp, 
exchanger)#===============================


Cheers,Tyler----------------------------------------
> Date: Fri, 27 Jan 2012 09:50:51 -0500
> From: frien...@yorku.ca
> To: r-help@r-project.org
> Subject: [R] tm package: handling contractions
>
> I tried making a wordcloud of Obama's State of the Union address using
> the tm package to process the text
>
> sotu <- scan(file="c:/R/data/sotu2012.txt", what="character")
> sotu <- tolower(sotu)
> corp <-Corpus(VectorSource(paste(sotu, collapse=" ")))
>
> corp <- tm_map(corp, removePunctuation)
> corp <- tm_map(corp, stemDocument)
> corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
> tdm <- TermDocumentMatrix(corp)
> m <- as.matrix(tdm)
> v <- sort(rowSums(m),decreasing=TRUE)
> d <- data.frame(word = names(v),freq=v)
>
> wordcloud(d$word,d$freq)
>
> I ended up with a large number of contractions that were split at the
> "’" character, e.g., "don’t" --> "don'"
> e.g.,
>
> > sotu[grep("’", sotu)]
> [1] "qaeda’s" "taliban’s" "america’s" "they’re" "don’t"
> [6] "we’re" "aren’t" "we’ve" "patton’s" "what’s"
> [11] "let’s" "weren’t," "couldn’t" "people’s" "didn’t"
> [16] "we’ve" "we’ve" "we’ve" "i’m" "that’s"
> [21] "world’s" "what’s" "can’t" "that’s" "it’s"
> [26] "lock’s" "let’s" "you’re" "shouldn’t" "you’re"
> [31] "you’re" "it’s" "i’ll" "we’re" "don’t"
> [36] "we’ve" "it’s" "it’s" "it’s" "they’re"
> ...
> [201] "didn’t" "bush’s" "didn’t" "can’t" "there’s"
> [206] "i’m" "other’s" "we’re"
> >
>
> NB: What appears as the ' character above actually the character hex 92,
> not hex 27 on my Windows system.
>
> This should be a common problem in text processing, but I don't see a
> transformation in the tm package that
> handles this nicely. Is there something I've missed?
>
> -Michael
>
> --
> Michael Friendly Email: friendly AT yorku DOT ca
> Professor, Psychology Dept.
> York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> 4700 Keele Street Web: http://www.datavis.ca
> Toronto, ONT M3J 1P3 CANADA
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
                                          
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tm package: handling contractions

Reply via email to