Re: [R] package "tm" fails to remove "the" with remove stopwords

2009-11-16 Thread Mark Kimpel
Thanks Ingo. Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please On Sun, Nov 15, 2009 at 11:05 AM, Ingo Feinerer wrote:

Re: [R] package "tm" fails to remove "the" with remove stopwords

2009-11-15 Thread Ingo Feinerer
On Thu, Nov 12, 2009 at 11:29:50AM -0500, Mark Kimpel wrote: > I am using code that previously worked to remove stopwords using package "tm". Thanks for reporting. This is a bug in the removeWords() function in tm version 0.5-1 available from CRAN: > require(tm) > myDocument <- c("the rain in Spa

Re: [R] package "tm" fails to remove "the" with remove stopwords

2009-11-13 Thread Sam Thomas
Mark, It looks like removeWords removed "the" in all instances except when "the" was the first word in your text. Maybe there is a parameter that needs to be set? I couldn't find anything on the help page. Here's an example of what I am seeing using the "crude" dataset #function re

Re: [R] package "tm" fails to remove "the" with remove stopwords

2009-11-13 Thread Mark Kimpel
Sam, Thanks for the example. Removing stop words after the DocumentTermMatrix has been created works fine if one is working with single words, but what if one is creating a dtm of possible combinations of words? Wouldn't one want to remove them from the corpus? Mark Mark W. Kimpel MD ** Neuroin

Re: [R] package "tm" fails to remove "the" with remove stopwords

2009-11-12 Thread Sam Thomas
I'm not sure what's wrong with your approach, but this seems to strip "the" require(tm) params <- list(minDocFreq = 1, removeNumbers = TRUE, stemming = TRUE, stopwords = TRUE,

[R] package "tm" fails to remove "the" with remove stopwords

2009-11-12 Thread Mark Kimpel
I am using code that previously worked to remove stopwords using package "tm". Even manually adding "the" to the list does not work to remove "the". This package has undergone extensive redevelopment with changes to the function syntax, so perhaps I am just missing something. Please see my simple