Thanks Milan, it is running now. It seems part of the problem, as you suggested were the packages. It seems that although I just installed Rweka, Snowball and the like they were out of date. So updataing fixed stemDocument. As for removeWords, that began working once I cut my data in half. Apparently there are some memory management issues I have yet to figure out. Thanks again for the help.
Triss Milan Bouchet-Valat wrote > > Le jeudi 10 mai 2012 à 17:12 -0700, Triss.Ashton a écrit : >> Alekseiy, I tried your recommendation with several variations. It still >> does >> not run. I think the problem has to do with R2.15 and the refreshed TM >> package. > It works here with R 2.15.0 and tm 0.5-7.2 (development version), all > other relevant packages of the same version as you (but on Linux 64 > bits). So it might not be the problem. > > I'm using the docs example as a test: > data("crude") > crude[[1]] > stemDocument(crude[[1]]) > >> Everything runs under R2.10 with the following code: >> >> a <- Corpus(VectorSource(df$text)) # create corpus object >> a <- tm_map(a, removePunctuation) >> a <- tm_map(a, removeNumbers) >> a <- tm_map(a, removeWords, stopwords("english")) >> a <- tm_map(a, stripWhitespace) >> a <- tm_map(a, stemDocument, language = "english") > Let's focus on the example from the docs, since it's simple. Anyway, you > example is not reproducible since you do not provide the original data. > >> >> This same code ran on R2.15 results in: >> 1. the removeWords working sometimes, and sometimes not. >> 2. and stemDocuments absolutely not working. >> >> Both error out. removeWords always stops reading in the stopword list on >> the same line number (I have added and subtracted words - no difference) >> - >> session info is below: >> >> > a <- tm_map(a, removeWords, stopwords("english")) >> >> Error in gsub(sprintf("\\b(%s)\\b", paste(words, collapse = "|")), "", : >> invalid regular expression >> '\b(a|about|above|across|after|again|against|all|almost|alone|along|already|also|although|always|am|among|an|and|another|any|anybody|anyone|anything|anywhere|are|area|areas|aren't|around|as|ask|asked|asking|asks|at|away|b|back|backed|backing|backs|be|became|because|become|becomes|been|before|began|behind|being|beings|below|best|better|between|big|both|but|by|c|came|can|cannot|can't|case|cases|certain|certainly|clear|clearly|come|could|couldn't|d|did|didn't|differ|different|differently|do|does|doesn't|doing|done|don't|down|downed|downing|downs|during|e|each|early|either|end|ended|ending|ends|enough|even|evenly|ever|every|everybody|everyone|everything|everywhere|f|face|faces|fact|facts|far|felt|few|find|finds|first|for|four|from|full|fully|further|furthered|furthering|furthers|g|gave|general|generally|get|gets|give|given|gives|go|going|good|goods|got|great|greater|greatest|group|grouped|grouping|groups|h|had|hadn't|has|hasn't|have|haven't|having|he|he >> >> >> > a <- tm_map(a, stemDocument, language = "english") >> Error in .jnew(name) : java.lang.ClassNotFoundException > This error suggests you should reconfigure Java. Have you tried > reinstalling rJava, Snowball, RWekajars and RWeka? > >> SessionInfo: >> >> > sessionInfo() >> R version 2.15.0 (2012-03-30) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats4 grid stats graphics grDevices utils datasets >> [8] methods base >> >> other attached packages: >> [1] topicmodels_0.1-5 slam_0.1-23 modeltools_0.2-19 lasso2_1.2-12 >> [5] pvclust_1.2-2 stringr_0.6 plyr_1.7.1 Snowball_0.0-8 >> [9] rJava_0.9-3 ggplot2_0.9.0 tm_0.5-7.1 >> twitteR_0.99.19 >> [13] rjson_0.2.8 RCurl_1.91-1.1 bitops_1.0-4.1 >> >> loaded via a namespace (and not attached): >> [1] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2 MASS_7.3-17 >> >> [5] memoise_0.1 munsell_0.3 proto_0.3-9.2 >> RColorBrewer_1.0-5 >> [9] reshape2_1.2.1 RWeka_0.4-11 RWekajars_3.7.5-1 >> scales_0.2.0 >> > >> Hi Triss, >> >> If you need to stem just one text in the Corupus use a[[n]]<-stemDocument >> >> Best, >> -Alex >> ________________________________________ >> From: r-help-bounces@ [r-help-bounces@] on behalf of Triss.Ashton >> [triss.ashton@] >> Sent: 02 May 2012 21:09 >> To: r-help@ >> Subject: Re: [R] Help with stemDocument >> >> I am having a problem with stemDocuments also. I can make it work by >> moving >> the data into a Corpus by using: >> >> > a <- Corpus(VectorSource(df$text)) # create corpus object >> > a <- tm_map(a, stemDocument, language = "english") >> >> but it is horrably slow. I want to stem outside the Corpus object like: >> >> >df$text <- stemDocument(df$text, language = "english") >> >> but it returns the original text. >> >> In fact, using the example in the tm package documentation does not work >> either: >> >> > data("crude") >> > crude[[1]] >> Diamond Shamrock Corp said that >> effective today it had cut its contract prices for crude oil by >> 1.50 dlrs a barrel. >> The reduction brings its posted price for West Texas >> Intermediate to 16.00 dlrs a barrel, the copany said. >> "The price reduction today was made in the light of falling >> oil product prices and a weak crude oil market," a company >> spokeswoman said. >> Diamond is the latest in a line of U.S. oil companies that >> have cut its contract, or posted, prices over the last two days >> citing weak oil markets. >> Reuter >> > stemDocument(crude[[1]], language = "english") # specify language >> Diamond Shamrock Corp said that >> effective today it had cut its contract prices for crude oil by >> 1.50 dlrs a barrel. >> The reduction brings its posted price for West Texas >> Intermediate to 16.00 dlrs a barrel, the copany said. >> "The price reduction today was made in the light of falling >> oil product prices and a weak crude oil market," a company >> spokeswoman said. >> Diamond is the latest in a line of U.S. oil companies that >> have cut its contract, or posted, prices over the last two days >> citing weak oil markets. >> Reuter >> > stemDocument(crude[[1]]) # language not specified >> Diamond Shamrock Corp said that >> effective today it had cut its contract prices for crude oil by >> 1.50 dlrs a barrel. >> The reduction brings its posted price for West Texas >> Intermediate to 16.00 dlrs a barrel, the copany said. >> "The price reduction today was made in the light of falling >> oil product prices and a weak crude oil market," a company >> spokeswoman said. >> Diamond is the latest in a line of U.S. oil companies that >> have cut its contract, or posted, prices over the last two days >> citing weak oil markets. >> Reuter >> > >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4604022.html >> Sent from the R help mailing list archive at Nabble.com. >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4625085.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help@ mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@ mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- View this message in context: http://r.789695.n4.nabble.com/Help-with-stemDocument-tp4554523p4630523.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.