Hello, I'd be very grateful for your help. I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, one for training an algorithm and the other for testing the algorithm. I applied similar preprocessing, including TFIDF transformation, to both sets, but R won't let me make predictions on the test set due to a different TFIDF matrix. I get the error message:
Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied I'd greatly appreciate a suggestion to overcome this problem. Thanks! Here's my R codes: > library(tidyverse) > library(tidytext) > library(caret) > library(kernlab) > library(tokenizers) > library(tm) > library(e1071) ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 (labelled M2) > url <- "D:/test/M2_75.csv" > d <- read_csv(url) ***CREATE TEXT CORPUS FROM TEXT COLUMN > train_text_corpus <- Corpus(VectorSource(d$Text)) ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > tokenize_document <- function(doc) { + doc_tokens <- unlist(tokenize_words(doc)) + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) + return(all_tokens) + } ***APPLY TOKENS TO DOCUMENTS > all_train_tokens <- lapply(train_text_corpus, tokenize_document) ***CREATE A DTM FROM THE TOKENS > train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > train_text_tfidf <- weightTfIdf(train_text_dtm) ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA > trainData <- data.frame(M2 = d$M2) ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME > trainData$text_tfidf <- I(as.matrix(train_text_tfidf)) ***DEFINE THE ML MODEL > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, > classProbs = TRUE) ***TRAIN SVM > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", > trControl = ctrl) ***SAVE SVM > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS") R code on my test set, which didn't work at last step: ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled M2) > url <- "D:/test/M2_25.csv" > d <- read_csv(url) ***CREATE TEXT CORPUS FROM TEXT COLUMN > test_text_corpus <- Corpus(VectorSource(d$Text)) ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > tokenize_document <- function(doc) { doc_tokens <- unlist(tokenize_words(doc)) doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) return(all_tokens) } ***APPLY TOKEN TO DOCUMENTS > all_test_tokens <- lapply(test_text_corpus, tokenize_document) ***CREATE A DTM FROM THE TOKENS > test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > test_text_tfidf <- weightTfIdf(test_text_dtm) ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA > testData <- data.frame(M2 = d$M2) ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA > testData$text_tfidf <- I(as.matrix(test_text_tfidf)) ***LOAD OLD MODEL model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") ***MAKE PREDICTIONS predictions <- predict(model_svmRadial, newdata = testData) This last line produces the error message: Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied Please help. Thanks! [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.