Re: [R] package "tm" fails to remove "the" with remove stopwords

Sam Thomas Fri, 13 Nov 2009 10:19:56 -0800

Mark,


It looks like removeWords removed "the" in all instances except when
"the" was the first word in your text.   Maybe there is a parameter that
needs to be set?  I couldn't find anything on the help page.  

 

Here's an example of what I am seeing using the "crude" dataset

 

#function removeWords does not appear to remove the first word

require(tm)

data("crude")

crude[[1]]

removeWords(crude[[1]], "Shamrock")  #Second word removed

removeWords(crude[[1]], "Diamond")  #First word not removed

 

Sam Thomas

 

 

From: Mark Kimpel [mailto:mwkim...@gmail.com] 
Sent: Friday, November 13, 2009 11:47 AM
To: Sam Thomas
Cc: r-help@r-project.org; feine...@logic.at
Subject: Re: package "tm" fails to remove "the" with remove stopwords

 

Sam,

 

Thanks for the example. Removing stop words after the DocumentTermMatrix
has been created works fine if one is working with single words, but
what if one is creating a dtm of possible combinations of words?
Wouldn't one want to remove them from the corpus?

 

Mark


Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please



On Thu, Nov 12, 2009 at 12:04 PM, Sam Thomas
<sam.tho...@revelanttech.com> wrote:

I'm not sure what's wrong with your approach, but this seems to strip
"the"

 

require(tm)

params <- list(minDocFreq = 1, 

                                removeNumbers = TRUE,

                                stemming = TRUE,

                                stopwords = TRUE,

                                weighting = weightTf)

 

myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

dtm <- DocumentTermMatrix(text.corp, control = params)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 

 

From: Mark Kimpel [mailto:mwkim...@gmail.com] 
Sent: Thursday, November 12, 2009 11:30 AM
To: r-help@r-project.org; feine...@logic.at; Sam Thomas
Subject: package "tm" fails to remove "the" with remove stopwords

 

I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work to remove
"the". This package has undergone extensive redevelopment with changes
to the function syntax, so perhaps I am just missing something. 

 

Please see my simple example, output, and sessionInfo() below.

 

Thanks!

Mark

 

require(tm)

myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

#########################

text.corp <- tm_map(text.corp, stripWhitespace)

text.corp <- tm_map(text.corp, removeNumbers)

text.corp <- tm_map(text.corp, removePunctuation)

## text.corp <- tm_map(text.corp, stemDocument)

text.corp <- tm_map(text.corp, removeWords, c("the",
stopwords("english")))

dtm <- DocumentTermMatrix(text.corp)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 

> dtm.mat

    Terms

Docs falls fetch hill jack jill mainly pail plain rain ran spain the
water

   1     0     0    0    0    0      0    0     0    1   0     1   1
0

   2     1     0    0    0    0      1    0     1    0   0     0   0
0

   3     0     0    1    1    1      0    0     0    0   1     0   0
0

   4     0     1    0    0    0      0    1     0    0   0     0   0
1

 

R version 2.10.0 Patched (2009-10-27 r50222) 

x86_64-unknown-linux-gnu 

 

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats     graphics  grDevices datasets  utils     methods   base


 

other attached packages:

[1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1    

 

loaded via a namespace (and not attached):

[1] grid_2.10.0  rJava_0.8-1  slam_0.1-6   tools_2.10.0

 

 

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please

 


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] package "tm" fails to remove "the" with remove stopwords

Reply via email to