Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit : > Alekseiy, I tried your recommendation with several variations. It still does > not run. I think the problem has to do with R2.15 and the refreshed TM > package. It works here with R 2.15.0 and tm 0.5-7.2 (development version), all other relevant packages of the same version as you (but on Linux 64 bits). So it might not be the problem.
I'm using the docs example as a test: data("crude") crude[[1]] stemDocument(crude[[1]]) > Everything runs under R2.10 with the following code: > > a <- Corpus(VectorSource(df$text)) # create corpus object > a <- tm_map(a, removePunctuation) > a <- tm_map(a, removeNumbers) > a <- tm_map(a, removeWords, stopwords("english")) > a <- tm_map(a, stripWhitespace) > a <- tm_map(a, stemDocument, language = "english") Let's focus on the example from the docs, since it's simple. Anyway, you example is not reproducible since you do not provide the original data. > > This same code ran on R2.15 results in: > 1. the removeWords working sometimes, and sometimes not. > 2. and stemDocuments absolutely not working. > > Both error out. removeWords always stops reading in the stopword list on > the same line number (I have added and subtracted words - no difference) - > session info is below: > > > a <- tm_map(a, removeWords, stopwords("english")) > > Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", : > invalid regular expression > '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he > > > > a <- tm_map(a, stemDocument, language = "english") > Error in .jnew(name) : java.lang.ClassNotFoundException This error suggests you should reconfigure Java. Have you tried reinstalling rJava, Snowball, RWekajars and RWeka? > SessionInfo: > > > sessionInfo() > R version 2.15.0 (2012-03-30) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats4 grid stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] topicmodels_0.1-5 slam_0.1-23 modeltools_0.2-19 lasso2_1.2-12 > [5] pvclust_1.2-2 stringr_0.6 plyr_1.7.1 Snowball_0.0-8 > [9] rJava_0.9-3 ggplot2_0.9.0 tm_0.5-7.1 twitteR_0.99.19 > [13] rjson_0.2.8 RCurl_1.91-1.1 bitops_1.0-4.1 > > loaded via a namespace (and not attached): > [1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 MASS_7.3-17 > > [5] memoise_0.1 munsell_0.3 proto_0.3-9.2 > RColorBrewer_1.0-5 > [9] reshape2_1.2.1 RWeka_0.4-11 RWekajars_3.7.5-1 scales_0.2.0 > > > > Hi Triss, > > If you need to stem just one text in the Corupus use a[[n]]<-stemDocument > > Best, > -Alex > ________________________________________ > From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton > [triss.ashton@] > Sent: 02 May 2012 21:09 > To: r-help@ > Subject: Re: [R] Help with stemDocument > > I am having a problem with stemDocuments also. I can make it work by moving > the data into a Corpus by using: > > > a <- Corpus(VectorSource(df$text)) # create corpus object > > a <- tm_map(a, stemDocument, language = "english") > > but it is horrably slow. I want to stem outside the Corpus object like: > > >df$text <- stemDocument(df$text, language = "english") > > but it returns the original text. > > In fact, using the example in the tm package documentation does not work > either: > > > data("crude") > > crude[[1]] > Diamond Shamrock Corp said that > effective today it had cut its contract prices for crude oil by > 1.50 dlrs a barrel. > The reduction brings its posted price for West Texas > Intermediate to 16.00 dlrs a barrel, the copany said. > "The price reduction today was made in the light of falling > oil product prices and a weak crude oil market," a company > spokeswoman said. > Diamond is the latest in a line of U.S. oil companies that > have cut its contract, or posted, prices over the last two days > citing weak oil markets. > Reuter > > stemDocument(crude[[1]], language = "english") # specify language > Diamond Shamrock Corp said that > effective today it had cut its contract prices for crude oil by > 1.50 dlrs a barrel. > The reduction brings its posted price for West Texas > Intermediate to 16.00 dlrs a barrel, the copany said. > "The price reduction today was made in the light of falling > oil product prices and a weak crude oil market," a company > spokeswoman said. > Diamond is the latest in a line of U.S. oil companies that > have cut its contract, or posted, prices over the last two days > citing weak oil markets. > Reuter > > stemDocument(crude[[1]]) # language not specified > Diamond Shamrock Corp said that > effective today it had cut its contract prices for crude oil by > 1.50 dlrs a barrel. > The reduction brings its posted price for West Texas > Intermediate to 16.00 dlrs a barrel, the copany said. > "The price reduction today was made in the light of falling > oil product prices and a weak crude oil market," a company > spokeswoman said. > Diamond is the latest in a line of U.S. oil companies that > have cut its contract, or posted, prices over the last two days > citing weak oil markets. > Reuter > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html > Sent from the R help mailing list archive at Nabble.com. > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.