Mark,
It looks like removeWords removed "the" in all instances except when "the" was the first word in your text. Maybe there is a parameter that needs to be set? I couldn't find anything on the help page. Here's an example of what I am seeing using the "crude" dataset #function removeWords does not appear to remove the first word require(tm) data("crude") crude[[1]] removeWords(crude[[1]], "Shamrock") #Second word removed removeWords(crude[[1]], "Diamond") #First word not removed Sam Thomas From: Mark Kimpel [mailto:mwkim...@gmail.com] Sent: Friday, November 13, 2009 11:47 AM To: Sam Thomas Cc: r-help@r-project.org; feine...@logic.at Subject: Re: package "tm" fails to remove "the" with remove stopwords Sam, Thanks for the example. Removing stop words after the DocumentTermMatrix has been created works fine if one is working with single words, but what if one is creating a dtm of possible combinations of words? Wouldn't one want to remove them from the corpus? Mark Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please On Thu, Nov 12, 2009 at 12:04 PM, Sam Thomas <sam.tho...@revelanttech.com> wrote: I'm not sure what's wrong with your approach, but this seems to strip "the" require(tm) params <- list(minDocFreq = 1, removeNumbers = TRUE, stemming = TRUE, stopwords = TRUE, weighting = weightTf) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) dtm <- DocumentTermMatrix(text.corp, control = params) dtm dtm.mat <- as.matrix(dtm) dtm.mat From: Mark Kimpel [mailto:mwkim...@gmail.com] Sent: Thursday, November 12, 2009 11:30 AM To: r-help@r-project.org; feine...@logic.at; Sam Thomas Subject: package "tm" fails to remove "the" with remove stopwords I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple example, output, and sessionInfo() below. Thanks! Mark require(tm) myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack and jill ran up the hill", "to fetch a pail of water") text.corp <- Corpus(VectorSource(myDocument)) ######################### text.corp <- tm_map(text.corp, stripWhitespace) text.corp <- tm_map(text.corp, removeNumbers) text.corp <- tm_map(text.corp, removePunctuation) ## text.corp <- tm_map(text.corp, stemDocument) text.corp <- tm_map(text.corp, removeWords, c("the", stopwords("english"))) dtm <- DocumentTermMatrix(text.corp) dtm dtm.mat <- as.matrix(dtm) dtm.mat > dtm.mat Terms Docs falls fetch hill jack jill mainly pail plain rain ran spain the water 1 0 0 0 0 0 0 0 0 1 0 1 1 0 2 1 0 0 0 0 1 0 1 0 0 0 0 0 3 0 0 1 1 1 0 0 0 0 1 0 0 0 4 0 1 0 0 0 0 1 0 0 0 0 0 1 R version 2.10.0 Patched (2009-10-27 r50222) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1 loaded via a namespace (and not attached): [1] grid_2.10.0 rJava_0.8-1 slam_0.1-6 tools_2.10.0 Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.