Thank you Bert and Ivan, I was building the SVM model in hopes of applying it to future cases and hoped that the model would be able to deal with new words it hadn't encountered during training. But I tried Bert's suggestion by converting all of the data to tokens, creating a DTM, transforming the whole thing with TFI DF, and then separating it 75%/25%. But when I began to train the SVM on the training data, R said it needed 26GB for a vector and crashed. I tried again, it crashed again. I don't know why this would happen. I'd just trained 4 SVM models using my previous method without any memory trouble on my 8GB CPU. I unsuccessfully tried to remove the new words from the new test data. Should I try that? Is there a way to stop my system from crashing with the new method?
Thank you for any ideas. Here is the code I used when I separated the data after converting to tokens and applying TFI DF: url <- "D:/test/M2.csv" data <- read_csv(url) text_corpus <- Corpus(VectorSource(data$Text)) tokenize_document <- function(doc) { doc_tokens <- unlist(tokenize_words(doc)) doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) return(all_tokens) } all_tokens <- lapply(text_corpus, tokenize_document) text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_tokens))) text_tfidf <- weightTfIdf(text_dtm) processed_data <- data.frame(M2 = data$M2, text_tfidf = as.matrix(text_tfidf)) indexes <- createDataPartition(processed_data$M2, p = 0.75, list = FALSE) trainData <- processed_data[indexes,] testData <- processed_data[-indexes,] ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs = TRUE) model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", trControl = ctrl) ________________________________ From: Ivan Krylov <krylov.r...@gmail.com> Sent: Saturday, August 12, 2023 12:49 AM To: James C Schopf <jcsch...@hotmail.com> Cc: r-help@r-project.org <r-help@r-project.org> Subject: Re: [R] Different TFIDF settings in test set prevent testing model � Fri, 11 Aug 2023 10:20:27 +0000 James C Schopf <jcsch...@hotmail.com> �����: > > train_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > > test_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) I understand the need to prepare the test dataset separately (e.g. in order to be able to work with text that don't exist at the time when model is trained), but since the model has no representation for tokens it (well, the tokeniser) hasn't seen during the training process, you have to ensure that test_text_dtm references exactly the same tokens as train_text_dtm, in the same order of the columns. Also, it probably makes sense to reuse the term frequency learned on the training document set; otherwise you may be importance-weighting different tokens than ones your SVM has learned as important if your test set has a significantly different distribution from that of the training set. Bert is probably right: with the API given by the tm package, it's seems easiest to tokenise and weight document-term matrices first, then split them into the train and test subsets. It may be worth asking the maintainer about applying previously "learned" transformations to new corpora. -- Best regards, Ivan [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.