I'm not sure what's wrong with your approach, but this seems to strip
"the"

 

require(tm)

params <- list(minDocFreq = 1, 

                                removeNumbers = TRUE,

                                stemming = TRUE,

                                stopwords = TRUE,

                                weighting = weightTf)

 

myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

dtm <- DocumentTermMatrix(text.corp, control = params)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 

 

From: Mark Kimpel [mailto:mwkim...@gmail.com] 
Sent: Thursday, November 12, 2009 11:30 AM
To: r-help@r-project.org; feine...@logic.at; Sam Thomas
Subject: package "tm" fails to remove "the" with remove stopwords

 

I am using code that previously worked to remove stopwords using package
"tm". Even manually adding "the" to the list does not work to remove
"the". This package has undergone extensive redevelopment with changes
to the function syntax, so perhaps I am just missing something. 

 

Please see my simple example, output, and sessionInfo() below.

 

Thanks!

Mark

 

require(tm)

myDocument <- c("the rain in Spain", "falls mainly on the plain", "jack
and jill ran up the hill", "to fetch a pail of water")

text.corp <- Corpus(VectorSource(myDocument))

#########################

text.corp <- tm_map(text.corp, stripWhitespace)

text.corp <- tm_map(text.corp, removeNumbers)

text.corp <- tm_map(text.corp, removePunctuation)

## text.corp <- tm_map(text.corp, stemDocument)

text.corp <- tm_map(text.corp, removeWords, c("the",
stopwords("english")))

dtm <- DocumentTermMatrix(text.corp)

dtm

dtm.mat <- as.matrix(dtm)

dtm.mat

 

> dtm.mat

    Terms

Docs falls fetch hill jack jill mainly pail plain rain ran spain the
water

   1     0     0    0    0    0      0    0     0    1   0     1   1
0

   2     1     0    0    0    0      1    0     1    0   0     0   0
0

   3     0     0    1    1    1      0    0     0    0   1     0   0
0

   4     0     1    0    0    0      0    1     0    0   0     0   0
1

 

R version 2.10.0 Patched (2009-10-27 r50222) 

x86_64-unknown-linux-gnu 

 

locale:

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    

 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   

 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats     graphics  grDevices datasets  utils     methods   base


 

other attached packages:

[1] chron_2.3-33 RWeka_0.3-23 tm_0.5-1    

 

loaded via a namespace (and not attached):

[1] grid_2.10.0  rJava_0.8-1  slam_0.1-6   tools_2.10.0

 

 

Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to