[R] Use generic functions, e.g. print, without UseMethod?
Hello, I have defined a function 'equations(...)' which returns an object with class 'equations'. I also defined a function 'print.equations' which prints the object. But I did not use 'equations <- function(x, ...) UseMethod("equations"). Two questions: 1.) Is this a sensible approach? 2.) If yes, are there any pitfalls I could run in later? Thanks Sigbert -- https://hu.berlin/sk https://www.stat.de/faqs https://hu.berlin/mmstat https://hu.berlin/mmstat-ar __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Use generic functions, e.g. print, without UseMethod?
On Fri, 11 Aug 2023 09:20:03 +0200 Sigbert Klinke wrote: > I have defined a function 'equations(...)' which returns an object > with class 'equations'. > But I did not use 'equations <- function(x, ...) > UseMethod("equations"). Two questions: > > 1.) Is this a sensible approach? Quite. If there is little reason for your constructor to be generic (i.e. there is only one way to construct "equations" objects), it can stay an ordinary R function. lm() works the same way, for example, and so do many statistical tests and contributed model functions. > 2.) If yes, are there any pitfalls I could run in later? If it later turns out that you need S3 dispatch on the constructor too, you will need to take care to design its formals to avoid breaking compatibility with the old code. Ideally, the generic should take (x, ...), with the first argument determining the method that will be called. If that would conflict with the already-existing code, the generic can have a different signature and give a different object= argument to UseMethod(), but the methods will have to follow the signature of the generic. -- Best regards, Ivan __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Puzzled by results from base::rank()
I understand that the default ties.method is "average". Here is what I get, expanding a bit on the help page example. Running R 4.3.1 on Ubuntu 22.04.2. > x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5) > rank(x2) [1] 4.5 1.5 6.0 1.5 8.0 11.0 3.0 10.0 8.0 4.5 8.0 OK so the ties, each of with two members, are ranked to their mean. So now I turn one tie from a twin to a triplet: > x3 <- c(x2, 3) > rank(x3) [1] 5.0 1.5 7.0 1.5 9.0 12.0 3.0 11.0 9.0 5.0 9.0 5.0 > sprintf("%4.3f", rank(x3)) [1] "5.000" "1.500" "7.000" "1.500" "9.000" "12.000" "3.000" "11.000" [9] "9.000" "5.000" "9.000" "5.000" The doublet is still given the mean of the values but the triplet is rounded up. What am I missing here?! TIA, Chris -- Chris Evans (he/him) Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, University of Roehampton, London, UK. Work web site: https://www.psyctc.org/psyctc/ CORE site: http://www.coresystemtrust.org.uk/ Personal site: https://www.psyctc.org/pelerinage2016/ __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Use generic functions, e.g. print, without UseMethod?
Às 08:20 de 11/08/2023, Sigbert Klinke escreveu: Hello, I have defined a function 'equations(...)' which returns an object with class 'equations'. I also defined a function 'print.equations' which prints the object. But I did not use 'equations <- function(x, ...) UseMethod("equations"). Two questions: 1.) Is this a sensible approach? 2.) If yes, are there any pitfalls I could run in later? Thanks Sigbert Hello, You have to ask yourself what kind of objects are you passing to 'equations(...)'? Do you need to have 'equations.double(...)' 'equations.character(...)' 'equations.formula(...)' 'equations.matrix(...)' [...] specifically written for objects of class numeric character formula matrix [...] respectively? These methods would act on the respective class, process those objects somewhat differently because they are of different classes and output an object of class "equation". (If so, it is recommended to write a 'equations.default(...)' too.) Methods such as print.equation or summary.equation are written when you want your new class to have functionality your new class' users are familiar with. If, for instance, autoprint is on as it frequently is, users can see their "equation" by typing its name at a prompt. print.equation would display the "equation" in a way relevant to that new class. But this does not mean that the function that *creates* the object needs to be generic, you only need a new generic to have methods processing inputs of different classes in ways specific to those classes. Hope this helps, Rui Barradas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Puzzled by results from base::rank()
Dear Chris, the members of the triplet would be ranked 4, 5 and 6 (in your example), so the *mean of their ranks* is correctly 5. For any set of k tied values the ranks of its elements are averaged (and assigned to each of its k members). Hth -- Gerrit - Dr. Gerrit Eichner Mathematical Institute, Room 215 gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner - Am 11.08.2023 um 09:54 schrieb Chris Evans via R-help: I understand that the default ties.method is "average". Here is what I get, expanding a bit on the help page example. Running R 4.3.1 on Ubuntu 22.04.2. > x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5) > rank(x2) [1] 4.5 1.5 6.0 1.5 8.0 11.0 3.0 10.0 8.0 4.5 8.0 OK so the ties, each of with two members, are ranked to their mean. So now I turn one tie from a twin to a triplet: > x3 <- c(x2, 3) > rank(x3) [1] 5.0 1.5 7.0 1.5 9.0 12.0 3.0 11.0 9.0 5.0 9.0 5.0 > sprintf("%4.3f", rank(x3)) [1] "5.000" "1.500" "7.000" "1.500" "9.000" "12.000" "3.000" "11.000" [9] "9.000" "5.000" "9.000" "5.000" The doublet is still given the mean of the values but the triplet is rounded up. What am I missing here?! TIA, Chris __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Puzzled by results from base::rank()
I have entered values into Excel, and sorted them. I am assuming you are asking why the value 3 in x2 is ranked 4.5 versus in x5 it has a rank of 5. X2 looks like this Value RankOrder 1 1.5 1 1 1.5 2 2 3 3 3 4.5 4 3 4.5 5 4 6 6 5 8 7 5 8 8 5 8 9 6 10 10 9 11 11 The average of 4 and 5 is 4.5. For x3 we have: Value RankOrder 1 1.5 1 1 1.5 2 2 3 3 3 5 4 3 5 5 3 5 6 4 7 7 5 9 8 5 9 9 5 9 10 6 11 11 9 12 12 The ranks of the threes are 4, 5, and 6 and the average is 5. For any set of values adding one value that is the same as an existing value will always increase the rank of that value. It has not been rounded up, though it may look that way in the example. If you add another 3 to the data the rank will increase to 5.5, and adding another three will give a rank of 6. Each additional 3 will boost the rank by 0.5. You can get a different result if you change a value. If there is a mistake in the data and I discover that the second 1 in x2 should be a 3, then the rank for 3 is 4 and it looks like I have rounded down. If the mistake happened for a value greater than 3 then it would again look like I had rounded up. However, the appearance of "rounding" is an illusion easily seen through if you expand your example to generalize the outcome. Tim -Original Message- From: R-help On Behalf Of Gerrit Eichner Sent: Friday, August 11, 2023 4:32 AM To: r-help@r-project.org Subject: Re: [R] Puzzled by results from base::rank() [External Email] Dear Chris, the members of the triplet would be ranked 4, 5 and 6 (in your example), so the *mean of their ranks* is correctly 5. For any set of k tied values the ranks of its elements are averaged (and assigned to each of its k members). Hth -- Gerrit - Dr. Gerrit Eichner Mathematical Institute, Room 215 gerrit.eich...@math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner - Am 11.08.2023 um 09:54 schrieb Chris Evans via R-help: > I understand that the default ties.method is "average". Here is what > I get, expanding a bit on the help page example. Running R 4.3.1 on > Ubuntu 22.04.2. > > > x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5) > rank(x2) > [1] 4.5 1.5 6.0 1.5 8.0 11.0 3.0 10.0 8.0 4.5 8.0 > > OK so the ties, each of with two members, are ranked to their mean. > > So now I turn one tie from a twin to a triplet: > > > x3 <- c(x2, 3) > > rank(x3) > [1] 5.0 1.5 7.0 1.5 9.0 12.0 3.0 11.0 9.0 5.0 9.0 5.0 > > sprintf("%4.3f", rank(x3)) > [1] "5.000" "1.500" "7.000" "1.500" "9.000" "12.000" "3.000" > "11.000" > [9] "9.000" "5.000" "9.000" "5.000" > > The doublet is still given the mean of the values but the triplet is > rounded up. What am I missing here?! > > TIA, > > Chris > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] group consecutive dates in a row
Thank you for your hints. All of them have been useful, and you solved my problem. I understood the role of rle, but I think that for my task its use is not fundamental. I will put more attention on looking for the existing documentation. Thank you again Stefano (oo) --oOO--( )--OOo-- Stefano Sofia PhD Civil Protection - Marche Region - Italy Meteo Section Snow Section Via del Colle Ameno 5 60126 Torrette di Ancona, Ancona (AN) Uff: +39 071 806 7743 E-mail: stefano.so...@regione.marche.it ---Oo-oO Da: Gabor Grothendieck Inviato: lunedì 7 agosto 2023 20:30 A: Stefano Sofia Cc: r-help@R-project.org Oggetto: Re: [R] group consecutive dates in a row It is best to use Date, rather than POSIXct, class if there are no times. Use the cumsum expression shown to group the dates and then summarize each group. We assume that the dates are already sorted in ascending order. library(dplyr) mydf <- data.frame(date = as.Date(c("2012-02-05", "2012-02-06", "2012-02-07", "2012-02-13", "2012-02-21"))) mydf %>% group_by(grp = cumsum(c(0, diff(date)) > 1)) %>% summarize(start = first(date), end = last(date)) %>% ungroup %>% select(-grp) ## # A tibble: 3 × 2 ## start end ## ## 1 2012-02-05 2012-02-07 ## 2 2012-02-13 2012-02-13 ## 3 2012-02-21 2012-02-21 or with only base R: smrz <- function(x) with(x, data.frame(start = min(date), end = max(date))) do.call("rbind", by(mydf, cumsum(c(0, diff(mydf$date)) > 1), smrz)) ##startend ## 0 2012-02-05 2012-02-07 ## 1 2012-02-13 2012-02-13 ## 2 2012-02-21 2012-02-21 On Mon, Aug 7, 2023 at 12:42 PM Stefano Sofia wrote: > > Dear R users, > > I have a data frame with a single column of POSIXct elements, like > > > mydf <- data.frame(data_POSIX=as.POSIXct(c("2012-02-05", "2012-02-06", > "2012-02-07", "2012-02-13", "2012-02-21"), format = "%Y-%m-%d", > tz="Etc/GMT-1")) > > > I need to transform it in a two-columns data frame where I can get rid of > consecutive dates. It should appear like > > > data_POSIX_init data_POSIX_fin > > 2012-02-05 2012-02-07 > > 2012-02-13 NA > > 2012-02-21 NA > > > I started with two "while cycles" and so on, but this is not an efficient way > to do it. > > Could you please give me an hint on how to proceed? > > > Thank you for your precious attention and help > > Stefano > > > (oo) > --oOO--( )--OOo-- > Stefano Sofia PhD > Civil Protection - Marche Region - Italy > Meteo Section > Snow Section > Via del Colle Ameno 5 > 60126 Torrette di Ancona, Ancona (AN) > Uff: +39 071 806 7743 > E-mail: stefano.so...@regione.marche.it > ---Oo-oO > > > > AVVISO IMPORTANTE: Questo messaggio di posta elettronica può contenere > informazioni confidenziali, pertanto è destinato solo a persone autorizzate > alla ricezione. I messaggi di posta elettronica per i client di Regione > Marche possono contenere informazioni confidenziali e con privilegi legali. > Se non si è il destinatario specificato, non leggere, copiare, inoltrare o > archiviare questo messaggio. Se si è ricevuto questo messaggio per errore, > inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio > computer. Ai sensi dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso > di necessità ed urgenza, la risposta al presente messaggio di posta > elettronica può essere visionata da persone estranee al destinatario. > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages to clients of Regione Marche may contain information that is > confidential and legally privileged. Please do not read, copy, forward, or > store this message unless you are an intended recipient of it. If you have > received this message in error, please forward it to the sender and delete it > completely from your computer system. > > -- > Questo messaggio stato analizzato da Libraesva ESG ed risultato non infetto. > This message was scanned by Libraesva ESG and is believed to be clean. > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://urlsand.esvalabs.com/?u=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&e=a1c37615&h=997ca565&f=y&p=y > PLEASE do read the posting guide > https://urlsand.esvalabs.com/?u=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&e=a1c37615&h=5a0f7b62&f=y&p=y > and provide commented, minimal, self-contained, reproducible code. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -- Questo messaggio stato analizzato con
[R] Different TFIDF settings in test set prevent testing model
Hello, I'd be very grateful for your help. I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, one for training an algorithm and the other for testing the algorithm. I applied similar preprocessing, including TFIDF transformation, to both sets, but R won't let me make predictions on the test set due to a different TFIDF matrix. I get the error message: Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied I'd greatly appreciate a suggestion to overcome this problem. Thanks! Here's my R codes: > library(tidyverse) > library(tidytext) > library(caret) > library(kernlab) > library(tokenizers) > library(tm) > library(e1071) ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 (labelled M2) > url <- "D:/test/M2_75.csv" > d <- read_csv(url) ***CREATE TEXT CORPUS FROM TEXT COLUMN > train_text_corpus <- Corpus(VectorSource(d$Text)) ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > tokenize_document <- function(doc) { + doc_tokens <- unlist(tokenize_words(doc)) + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) + return(all_tokens) + } ***APPLY TOKENS TO DOCUMENTS > all_train_tokens <- lapply(train_text_corpus, tokenize_document) ***CREATE A DTM FROM THE TOKENS > train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > train_text_tfidf <- weightTfIdf(train_text_dtm) ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA > trainData <- data.frame(M2 = d$M2) ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME > trainData$text_tfidf <- I(as.matrix(train_text_tfidf)) ***DEFINE THE ML MODEL > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, > classProbs = TRUE) ***TRAIN SVM > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", > trControl = ctrl) ***SAVE SVM > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS") R code on my test set, which didn't work at last step: ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled M2) > url <- "D:/test/M2_25.csv" > d <- read_csv(url) ***CREATE TEXT CORPUS FROM TEXT COLUMN > test_text_corpus <- Corpus(VectorSource(d$Text)) ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > tokenize_document <- function(doc) { doc_tokens <- unlist(tokenize_words(doc)) doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) return(all_tokens) } ***APPLY TOKEN TO DOCUMENTS > all_test_tokens <- lapply(test_text_corpus, tokenize_document) ***CREATE A DTM FROM THE TOKENS > test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > test_text_tfidf <- weightTfIdf(test_text_dtm) ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA > testData <- data.frame(M2 = d$M2) ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA > testData$text_tfidf <- I(as.matrix(test_text_tfidf)) ***LOAD OLD MODEL model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") ***MAKE PREDICTIONS predictions <- predict(model_svmRadial, newdata = testData) This last line produces the error message: Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied Please help. Thanks! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Different TFIDF settings in test set prevent testing model
I know nothing about tf, etc., but can you not simply read in the whole file into R and then randomly split using R? The training and test sets would simply be defined by a single random sample of subscripts which is either chosen or not. e.g. (simplified example -- you would be subsetting the rows of your full dataset): > x<- 1:10 > samp <- sort(sample(x,5)) > x[samp] ## training [1] 3 4 6 7 8 > x[-samp] ## test [1] 1 2 5 9 10 Apologies if my ignorance means this can't work. Cheers, Bert On Fri, Aug 11, 2023 at 7:17 AM James C Schopf wrote: > Hello, I'd be very grateful for your help. > > I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv > files, one for training an algorithm and the other for testing the > algorithm. I applied similar preprocessing, including TFIDF > transformation, to both sets, but R won't let me make predictions on the > test set due to a different TFIDF matrix. > I get the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > I'd greatly appreciate a suggestion to overcome this problem. > Thanks! > > > Here's my R codes: > > > library(tidyverse) > > library(tidytext) > > library(caret) > > library(kernlab) > > library(tokenizers) > > library(tm) > > library(e1071) > > ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_75.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > train_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > + doc_tokens <- unlist(tokenize_words(doc)) > + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > + return(all_tokens) > + } > ***APPLY TOKENS TO DOCUMENTS > > all_train_tokens <- lapply(train_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > train_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > train_text_tfidf <- weightTfIdf(train_text_dtm) > ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA > > trainData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME > > trainData$text_tfidf <- I(as.matrix(train_text_tfidf)) > ***DEFINE THE ML MODEL > > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, > classProbs = TRUE) > ***TRAIN SVM > > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", > trControl = ctrl) > ***SAVE SVM > > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS") > > R code on my test set, which didn't work at last step: > > ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_25.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > test_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > doc_tokens <- unlist(tokenize_words(doc)) > doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > return(all_tokens) > } > ***APPLY TOKEN TO DOCUMENTS > > all_test_tokens <- lapply(test_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > test_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > test_text_tfidf <- weightTfIdf(test_text_dtm) > ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA > > testData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA > > testData$text_tfidf <- I(as.matrix(test_text_tfidf)) > ***LOAD OLD MODEL > model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") > ***MAKE PREDICTIONS > predictions <- predict(model_svmRadial, newdata = testData) > > This last line produces the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > Please help. Thanks! > > > > > > > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide comm
Re: [R] Different TFIDF settings in test set prevent testing model
В Fri, 11 Aug 2023 10:20:27 + James C Schopf пишет: > > train_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > > test_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) I understand the need to prepare the test dataset separately (e.g. in order to be able to work with text that don't exist at the time when model is trained), but since the model has no representation for tokens it (well, the tokeniser) hasn't seen during the training process, you have to ensure that test_text_dtm references exactly the same tokens as train_text_dtm, in the same order of the columns. Also, it probably makes sense to reuse the term frequency learned on the training document set; otherwise you may be importance-weighting different tokens than ones your SVM has learned as important if your test set has a significantly different distribution from that of the training set. Bert is probably right: with the API given by the tm package, it's seems easiest to tokenise and weight document-term matrices first, then split them into the train and test subsets. It may be worth asking the maintainer about applying previously "learned" transformations to new corpora. -- Best regards, Ivan __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Different TFIDF settings in test set prevent testing model
Thank you Bert and Ivan, I was building the SVM model in hopes of applying it to future cases and hoped that the model would be able to deal with new words it hadn't encountered during training. But I tried Bert's suggestion by converting all of the data to tokens, creating a DTM, transforming the whole thing with TFI DF, and then separating it 75%/25%. But when I began to train the SVM on the training data, R said it needed 26GB for a vector and crashed. I tried again, it crashed again.I don't know why this would happen. I'd just trained 4 SVM models using my previous method without any memory trouble on my 8GB CPU.I unsuccessfully tried to remove the new words from the new test data. Should I try that? Is there a way to stop my system from crashing with the new method? Thank you for any ideas. Here is the code I used when I separated the data after converting to tokens and applying TFI DF: url <- "D:/test/M2.csv" data <- read_csv(url) text_corpus <- Corpus(VectorSource(data$Text)) tokenize_document <- function(doc) { doc_tokens <- unlist(tokenize_words(doc)) doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) return(all_tokens) } all_tokens <- lapply(text_corpus, tokenize_document) text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_tokens))) text_tfidf <- weightTfIdf(text_dtm) processed_data <- data.frame(M2 = data$M2, text_tfidf = as.matrix(text_tfidf)) indexes <- createDataPartition(processed_data$M2, p = 0.75, list = FALSE) trainData <- processed_data[indexes,] testData <- processed_data[-indexes,] ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs = TRUE) model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", trControl = ctrl) From: Ivan Krylov Sent: Saturday, August 12, 2023 12:49 AM To: James C Schopf Cc: r-help@r-project.org Subject: Re: [R] Different TFIDF settings in test set prevent testing model � Fri, 11 Aug 2023 10:20:27 + James C Schopf �: > > train_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > > test_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) I understand the need to prepare the test dataset separately (e.g. in order to be able to work with text that don't exist at the time when model is trained), but since the model has no representation for tokens it (well, the tokeniser) hasn't seen during the training process, you have to ensure that test_text_dtm references exactly the same tokens as train_text_dtm, in the same order of the columns. Also, it probably makes sense to reuse the term frequency learned on the training document set; otherwise you may be importance-weighting different tokens than ones your SVM has learned as important if your test set has a significantly different distribution from that of the training set. Bert is probably right: with the API given by the tm package, it's seems easiest to tokenise and weight document-term matrices first, then split them into the train and test subsets. It may be worth asking the maintainer about applying previously "learned" transformations to new corpora. -- Best regards, Ivan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] geom_smooth
Colleagues, Here is my reproducible code for a graph using geom_smooth set.seed(55) scatter_data <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100)) library(ggplot2) library(cowplot) ggplot(scatter_data,aes(x=x_var,y=y_var))+ geom_point()+ geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+ theme_cowplot() I'd like to add a black boundary around the shaded area. I suspect this can be done with geom_ribbon but I cannot figure this out. Some advice would be welcome. Thanks! Thomas Subia __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] geom_smooth
Às 05:17 de 12/08/2023, Thomas Subia via R-help escreveu: Colleagues, Here is my reproducible code for a graph using geom_smooth set.seed(55) scatter_data <- tibble(x_var = runif(100, min = 0, max = 25) ,y_var = log2(x_var) + rnorm(100)) library(ggplot2) library(cowplot) ggplot(scatter_data,aes(x=x_var,y=y_var))+ geom_point()+ geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+ theme_cowplot() I'd like to add a black boundary around the shaded area. I suspect this can be done with geom_ribbon but I cannot figure this out. Some advice would be welcome. Thanks! Thomas Subia __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Hello, Here is a solution. You ,ust access the computed variables, which you can with ?ggplot_build. Then pass them in the data argument. p <- ggplot(scatter_data,aes(x=x_var,y=y_var)) + geom_point()+ geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+ theme_cowplot() # this is a data.frame, relevant columns are x, ymin and ymax fit <- ggplot_build(p)$data[[2]] p + geom_line(data = fit, aes(x, ymin), linetype = "dashed", linewidth = 1) + geom_line(data = fit, aes(x, ymax), linetype = "dashed", linewidth = 1) Hope this helps, Rui Barradas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] geom_smooth
G'day Thomas, On Sat, 12 Aug 2023 04:17:42 + (UTC) Thomas Subia via R-help wrote: > Here is my reproducible code for a graph using geom_smooth The call "library(tidyverse)" was missing. :) > I'd like to add a black boundary around the shaded area. I suspect > this can be done with geom_ribbon but I cannot figure this out. Some > advice would be welcome. This works for me: ggplot(scatter_data,aes(x=x_var,y=y_var,))+ geom_point()+ geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed") + geom_ribbon(stat="smooth", aes(ymin=after_stat(ymin), ymax=after_stat(ymax)), fill=NA, color="black")+ theme_cowplot() Cheers, Berwin __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] geom_smooth
+ geom_ribbon(stat = "smooth", se = TRUE, alpha = 0, # or, use fill = NA colour = "black", linetype = "dotted") Does that work? On Sat, 12 Aug 2023, 06:12 Rui Barradas, wrote: > Às 05:17 de 12/08/2023, Thomas Subia via R-help escreveu: > > Colleagues, > > > > Here is my reproducible code for a graph using geom_smooth > > set.seed(55) > > scatter_data <- tibble(x_var = runif(100, min = 0, max = 25) > > ,y_var = log2(x_var) + rnorm(100)) > > > > library(ggplot2) > > library(cowplot) > > > > ggplot(scatter_data,aes(x=x_var,y=y_var))+ > >geom_point()+ > >geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+ > >theme_cowplot() > > > > I'd like to add a black boundary around the shaded area. I suspect this > can be done with geom_ribbon but I cannot figure this out. Some advice > would be welcome. > > > > Thanks! > > > > Thomas Subia > > > > __ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > Here is a solution. You ,ust access the computed variables, which you > can with ?ggplot_build. > Then pass them in the data argument. > > > > p <- ggplot(scatter_data,aes(x=x_var,y=y_var)) + >geom_point()+ >geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+ >theme_cowplot() > > # this is a data.frame, relevant columns are x, ymin and ymax > fit <- ggplot_build(p)$data[[2]] > > p + >geom_line(data = fit, aes(x, ymin), linetype = "dashed", linewidth = 1) > + >geom_line(data = fit, aes(x, ymax), linetype = "dashed", linewidth = 1) > > > Hope this helps, > > Rui Barradas > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.