[R] Different TFIDF settings in test set prevent testing model

James C Schopf Fri, 11 Aug 2023 07:17:35 -0700

Hello, I'd be very grateful for your help.

I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, 
one for training an algorithm and the other for testing the algorithm.  I 
applied similar preprocessing, including TFIDF transformation, to both sets, 
but R won't let me make predictions on the test set due to a different TFIDF 
matrix.
I get the error message:


Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

I'd greatly appreciate a suggestion to overcome this problem.
Thanks!


Here's my R codes:

> library(tidyverse)
> library(tidytext)
> library(caret)
> library(kernlab)
> library(tokenizers)
> library(tm)
> library(e1071)

***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 
(labelled M2)
> url <- "D:/test/M2_75.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> train_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
+     doc_tokens <- unlist(tokenize_words(doc))
+     doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
+     doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
+     all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
+     return(all_tokens)
+ }
***APPLY TOKENS TO DOCUMENTS
> all_train_tokens <- lapply(train_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> train_text_tfidf <- weightTfIdf(train_text_dtm)
***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA
> trainData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME
> trainData$text_tfidf <- I(as.matrix(train_text_tfidf))
***DEFINE THE ML MODEL
> ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, 
> classProbs = TRUE)
***TRAIN SVM
> model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", 
> trControl = ctrl)
***SAVE SVM
> saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")

R code on my test set, which didn't work at last step:

***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled 
M2)
> url <- "D:/test/M2_25.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> test_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
     doc_tokens <- unlist(tokenize_words(doc))
     doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
     doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
     all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
     return(all_tokens)
 }
***APPLY TOKEN TO DOCUMENTS
> all_test_tokens <- lapply(test_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> test_text_tfidf <- weightTfIdf(test_text_dtm)
***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA
> testData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA
> testData$text_tfidf <- I(as.matrix(test_text_tfidf))
***LOAD OLD MODEL
model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS")
 ***MAKE PREDICTIONS
predictions <- predict(model_svmRadial, newdata = testData)

This last line produces the error message:

Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

Please help.  Thanks!








        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Different TFIDF settings in test set prevent testing model

Reply via email to