[R] Use generic functions, e.g. print, without UseMethod?

2023-08-11 Thread Sigbert Klinke

Hello,

I have defined a function 'equations(...)' which returns an object with 
class 'equations'. I also defined a function 'print.equations' which 
prints the object. But I did not use 'equations <- function(x, ...) 
UseMethod("equations"). Two questions:


1.) Is this a sensible approach?
2.) If yes, are there any pitfalls I could run in later?

Thanks

Sigbert

--
https://hu.berlin/sk
https://www.stat.de/faqs
https://hu.berlin/mmstat
https://hu.berlin/mmstat-ar

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Use generic functions, e.g. print, without UseMethod?

2023-08-11 Thread Ivan Krylov
On Fri, 11 Aug 2023 09:20:03 +0200
Sigbert Klinke  wrote:

> I have defined a function 'equations(...)' which returns an object
> with class 'equations'.

> But I did not use 'equations <- function(x, ...)
> UseMethod("equations"). Two questions:
> 
> 1.) Is this a sensible approach?

Quite. If there is little reason for your constructor to be generic
(i.e. there is only one way to construct "equations" objects), it can
stay an ordinary R function. lm() works the same way, for example, and
so do many statistical tests and contributed model functions.

> 2.) If yes, are there any pitfalls I could run in later?

If it later turns out that you need S3 dispatch on the constructor too,
you will need to take care to design its formals to avoid breaking
compatibility with the old code. Ideally, the generic should take (x,
...), with the first argument determining the method that will be
called. If that would conflict with the already-existing code, the
generic can have a different signature and give a different object=
argument to UseMethod(), but the methods will have to follow the
signature of the generic.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Puzzled by results from base::rank()

2023-08-11 Thread Chris Evans via R-help
I understand that the default ties.method is "average".  Here is what I 
get, expanding a bit on the help page example. Running R 4.3.1 on Ubuntu 
22.04.2.


> x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)
> rank(x2)
 [1]  4.5  1.5  6.0  1.5  8.0 11.0  3.0 10.0  8.0  4.5  8.0

OK so the ties, each of with two members, are ranked to their mean.

So now I turn one tie from a twin to a triplet:

> x3 <- c(x2, 3)
> rank(x3)
 [1]  5.0  1.5  7.0  1.5  9.0 12.0  3.0 11.0  9.0  5.0  9.0  5.0
> sprintf("%4.3f", rank(x3))
 [1] "5.000"  "1.500"  "7.000"  "1.500"  "9.000"  "12.000" "3.000"  
"11.000"

 [9] "9.000"  "5.000"  "9.000"  "5.000"

The doublet is still given the mean of the values but the triplet is 
rounded up.  What am I missing here?!


TIA,

Chris

--
Chris Evans (he/him)
Visiting Professor, UDLA, Quito, Ecuador & Honorary Professor, 
University of Roehampton, London, UK.

Work web site: https://www.psyctc.org/psyctc/
CORE site: http://www.coresystemtrust.org.uk/
Personal site: https://www.psyctc.org/pelerinage2016/

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Use generic functions, e.g. print, without UseMethod?

2023-08-11 Thread Rui Barradas

Às 08:20 de 11/08/2023, Sigbert Klinke escreveu:

Hello,

I have defined a function 'equations(...)' which returns an object with 
class 'equations'. I also defined a function 'print.equations' which 
prints the object. But I did not use 'equations <- function(x, ...) 
UseMethod("equations"). Two questions:


1.) Is this a sensible approach?
2.) If yes, are there any pitfalls I could run in later?

Thanks

Sigbert


Hello,

You have to ask yourself what kind of objects are you passing to 
'equations(...)'?

Do you need to have

'equations.double(...)'
'equations.character(...)'
'equations.formula(...)'
'equations.matrix(...)'
[...]

specifically written for objects of class

numeric
character
formula
matrix
[...]

respectively?
These methods would act on the respective class, process those objects 
somewhat differently because they are of different classes and output an 
object of class "equation".

(If so, it is recommended to write a 'equations.default(...)' too.)

Methods such as print.equation or summary.equation are written when you 
want your new class to have functionality your new class' users are 
familiar with.


If, for instance, autoprint is on as it frequently is, users can see 
their "equation" by typing its name at a prompt. print.equation would 
display the "equation" in a way relevant to that new class.


But this does not mean that the function that *creates* the object needs 
to be generic, you only need a new generic to have methods processing 
inputs of different classes in ways specific to those classes.


Hope this helps,

Rui Barradas

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Puzzled by results from base::rank()

2023-08-11 Thread Gerrit Eichner

Dear Chris,

the members of the triplet would be ranked 4, 5 and 6 (in your example), 
so the *mean of their ranks* is correctly 5.


For any set of k tied values the ranks of its elements are averaged (and 
assigned to each of its k members).


 Hth  --  Gerrit

-
Dr. Gerrit Eichner   Mathematical Institute, Room 215
gerrit.eich...@math.uni-giessen.de   Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104  Arndtstr. 2, 35392 Giessen, Germany
http://www.uni-giessen.de/eichner
-

Am 11.08.2023 um 09:54 schrieb Chris Evans via R-help:
I understand that the default ties.method is "average".  Here is what I 
get, expanding a bit on the help page example. Running R 4.3.1 on Ubuntu 
22.04.2.


 > x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)
 > rank(x2)
  [1]  4.5  1.5  6.0  1.5  8.0 11.0  3.0 10.0  8.0  4.5  8.0

OK so the ties, each of with two members, are ranked to their mean.

So now I turn one tie from a twin to a triplet:

 > x3 <- c(x2, 3)
 > rank(x3)
  [1]  5.0  1.5  7.0  1.5  9.0 12.0  3.0 11.0  9.0  5.0  9.0  5.0
 > sprintf("%4.3f", rank(x3))
  [1] "5.000"  "1.500"  "7.000"  "1.500"  "9.000"  "12.000" "3.000" 
"11.000"

  [9] "9.000"  "5.000"  "9.000"  "5.000"

The doublet is still given the mean of the values but the triplet is 
rounded up.  What am I missing here?!


TIA,

Chris



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Puzzled by results from base::rank()

2023-08-11 Thread Ebert,Timothy Aaron
I have entered values into Excel, and sorted them. I am assuming you are asking 
why the value 3 in x2 is ranked 4.5 versus in x5 it has a rank of 5.
X2 looks like this
Value   RankOrder
1   1.5 1
1   1.5 2
2   3   3
3   4.5 4
3   4.5 5
4   6   6
5   8   7
5   8   8
5   8   9
6   10  10
9   11  11

The average of 4 and 5 is 4.5.

For x3 we have:

Value   RankOrder
1   1.5 1
1   1.5 2
2   3   3
3   5   4
3   5   5
3   5   6
4   7   7
5   9   8
5   9   9
5   9   10
6   11  11
9   12  12

The ranks of the threes are 4, 5, and 6 and the average is 5.
For any set of values adding one value that is the same as an existing value 
will always increase the rank of that value. It has not been rounded up, though 
it may look that way in the example. If you add another 3 to the data the rank 
will increase to 5.5, and adding another three will give a rank of 6. Each 
additional 3 will boost the rank by 0.5.

You can get a different result if you change a value. If there is a mistake in 
the data and I discover that the second 1 in x2 should be a 3, then the rank 
for 3 is 4 and it looks like I have rounded down. If the mistake happened for a 
value greater than 3 then it would again look like I had rounded up. However, 
the appearance of "rounding" is an illusion easily seen through if you expand 
your example to generalize the outcome.



Tim

-Original Message-
From: R-help  On Behalf Of Gerrit Eichner
Sent: Friday, August 11, 2023 4:32 AM
To: r-help@r-project.org
Subject: Re: [R] Puzzled by results from base::rank()

[External Email]

Dear Chris,

the members of the triplet would be ranked 4, 5 and 6 (in your example), so the 
*mean of their ranks* is correctly 5.

For any set of k tied values the ranks of its elements are averaged (and 
assigned to each of its k members).

  Hth  --  Gerrit

-
Dr. Gerrit Eichner   Mathematical Institute, Room 215
gerrit.eich...@math.uni-giessen.de   Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104  Arndtstr. 2, 35392 Giessen, Germany
http://www.uni-giessen.de/eichner
-

Am 11.08.2023 um 09:54 schrieb Chris Evans via R-help:
> I understand that the default ties.method is "average".  Here is what
> I get, expanding a bit on the help page example. Running R 4.3.1 on
> Ubuntu 22.04.2.
>
>  > x2 <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)  > rank(x2)
>   [1]  4.5  1.5  6.0  1.5  8.0 11.0  3.0 10.0  8.0  4.5  8.0
>
> OK so the ties, each of with two members, are ranked to their mean.
>
> So now I turn one tie from a twin to a triplet:
>
>  > x3 <- c(x2, 3)
>  > rank(x3)
>   [1]  5.0  1.5  7.0  1.5  9.0 12.0  3.0 11.0  9.0  5.0  9.0  5.0  >
> sprintf("%4.3f", rank(x3))
>   [1] "5.000"  "1.500"  "7.000"  "1.500"  "9.000"  "12.000" "3.000"
> "11.000"
>   [9] "9.000"  "5.000"  "9.000"  "5.000"
>
> The doublet is still given the mean of the values but the triplet is
> rounded up.  What am I missing here?!
>
> TIA,
>
> Chris
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] group consecutive dates in a row

2023-08-11 Thread Stefano Sofia
Thank you for your hints.

All of them have been useful, and you solved my problem.

I understood the role of rle, but I think that for my task its use is not 
fundamental.


I will put more attention on looking for the existing documentation.

Thank you again

Stefano


 (oo)
--oOO--( )--OOo--
Stefano Sofia PhD
Civil Protection - Marche Region - Italy
Meteo Section
Snow Section
Via del Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.so...@regione.marche.it
---Oo-oO



Da: Gabor Grothendieck 
Inviato: lunedì 7 agosto 2023 20:30
A: Stefano Sofia
Cc: r-help@R-project.org
Oggetto: Re: [R] group consecutive dates in a row

It is best to use Date, rather than POSIXct, class if there are no times.

Use the cumsum expression shown to group the dates and then summarize
each group.

We assume that the dates are already sorted in ascending order.

  library(dplyr)

  mydf <- data.frame(date = as.Date(c("2012-02-05", "2012-02-06",
"2012-02-07", "2012-02-13", "2012-02-21")))

  mydf %>%
group_by(grp = cumsum(c(0, diff(date)) > 1)) %>%
summarize(start = first(date), end = last(date)) %>%
ungroup %>%
select(-grp)
  ## # A tibble: 3 × 2
  ##   start  end
  ##
  ## 1 2012-02-05 2012-02-07
  ## 2 2012-02-13 2012-02-13
  ## 3 2012-02-21 2012-02-21

or with only base R:

  smrz <- function(x) with(x, data.frame(start = min(date), end = max(date)))
  do.call("rbind", by(mydf, cumsum(c(0, diff(mydf$date)) > 1), smrz))
  ##startend
  ## 0 2012-02-05 2012-02-07
  ## 1 2012-02-13 2012-02-13
  ## 2 2012-02-21 2012-02-21


On Mon, Aug 7, 2023 at 12:42 PM Stefano Sofia
 wrote:
>
> Dear R users,
>
> I have a data frame with a single column of POSIXct elements, like
>
>
> mydf <- data.frame(data_POSIX=as.POSIXct(c("2012-02-05", "2012-02-06", 
> "2012-02-07", "2012-02-13", "2012-02-21"), format = "%Y-%m-%d", 
> tz="Etc/GMT-1"))
>
>
> I need to transform it in a two-columns data frame where I can get rid of 
> consecutive dates. It should appear like
>
>
> data_POSIX_init data_POSIX_fin
>
> 2012-02-05 2012-02-07
>
> 2012-02-13 NA
>
> 2012-02-21 NA
>
>
> I started with two "while cycles" and so on, but this is not an efficient way 
> to do it.
>
> Could you please give me an hint on how to proceed?
>
>
> Thank you for your precious attention and help
>
> Stefano
>
>
>  (oo)
> --oOO--( )--OOo--
> Stefano Sofia PhD
> Civil Protection - Marche Region - Italy
> Meteo Section
> Snow Section
> Via del Colle Ameno 5
> 60126 Torrette di Ancona, Ancona (AN)
> Uff: +39 071 806 7743
> E-mail: stefano.so...@regione.marche.it
> ---Oo-oO
>
> 
>
> AVVISO IMPORTANTE: Questo messaggio di posta elettronica può contenere 
> informazioni confidenziali, pertanto è destinato solo a persone autorizzate 
> alla ricezione. I messaggi di posta elettronica per i client di Regione 
> Marche possono contenere informazioni confidenziali e con privilegi legali. 
> Se non si è il destinatario specificato, non leggere, copiare, inoltrare o 
> archiviare questo messaggio. Se si è ricevuto questo messaggio per errore, 
> inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio 
> computer. Ai sensi dell'art. 6 della DGR n. 1394/2008 si segnala che, in caso 
> di necessità ed urgenza, la risposta al presente messaggio di posta 
> elettronica può essere visionata da persone estranee al destinatario.
> IMPORTANT NOTICE: This e-mail message is intended to be received only by 
> persons entitled to receive the confidential information it may contain. 
> E-mail messages to clients of Regione Marche may contain information that is 
> confidential and legally privileged. Please do not read, copy, forward, or 
> store this message unless you are an intended recipient of it. If you have 
> received this message in error, please forward it to the sender and delete it 
> completely from your computer system.
>
> --
> Questo messaggio  stato analizzato da Libraesva ESG ed  risultato non infetto.
> This message was scanned by Libraesva ESG and is believed to be clean.
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>  
> https://urlsand.esvalabs.com/?u=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&e=a1c37615&h=997ca565&f=y&p=y
> PLEASE do read the posting guide  
> https://urlsand.esvalabs.com/?u=http%3A%2F%2Fwww.R-project.org%2Fposting-guide.html&e=a1c37615&h=5a0f7b62&f=y&p=y
> and provide commented, minimal, self-contained, reproducible code.



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

--

Questo messaggio  stato analizzato con

[R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread James C Schopf
Hello, I'd be very grateful for your help.

I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, 
one for training an algorithm and the other for testing the algorithm.  I 
applied similar preprocessing, including TFIDF transformation, to both sets, 
but R won't let me make predictions on the test set due to a different TFIDF 
matrix.
I get the error message:

Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

I'd greatly appreciate a suggestion to overcome this problem.
Thanks!


Here's my R codes:

> library(tidyverse)
> library(tidytext)
> library(caret)
> library(kernlab)
> library(tokenizers)
> library(tm)
> library(e1071)

***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 
(labelled M2)
> url <- "D:/test/M2_75.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> train_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
+ doc_tokens <- unlist(tokenize_words(doc))
+ doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
+ doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
+ all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
+ return(all_tokens)
+ }
***APPLY TOKENS TO DOCUMENTS
> all_train_tokens <- lapply(train_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> train_text_tfidf <- weightTfIdf(train_text_dtm)
***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA
> trainData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME
> trainData$text_tfidf <- I(as.matrix(train_text_tfidf))
***DEFINE THE ML MODEL
> ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, 
> classProbs = TRUE)
***TRAIN SVM
> model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", 
> trControl = ctrl)
***SAVE SVM
> saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")

R code on my test set, which didn't work at last step:

***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled 
M2)
> url <- "D:/test/M2_25.csv"
> d <- read_csv(url)
***CREATE TEXT CORPUS FROM TEXT COLUMN
> test_text_corpus <- Corpus(VectorSource(d$Text))
***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> tokenize_document <- function(doc) {
 doc_tokens <- unlist(tokenize_words(doc))
 doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
 doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
 all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
 return(all_tokens)
 }
***APPLY TOKEN TO DOCUMENTS
> all_test_tokens <- lapply(test_text_corpus, tokenize_document)
***CREATE A DTM FROM THE TOKENS
> test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))
***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> test_text_tfidf <- weightTfIdf(test_text_dtm)
***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA
> testData <- data.frame(M2 = d$M2)
***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA
> testData$text_tfidf <- I(as.matrix(test_text_tfidf))
***LOAD OLD MODEL
model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS")
 ***MAKE PREDICTIONS
predictions <- predict(model_svmRadial, newdata = testData)

This last line produces the error message:

Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type 
"nmatrix.27118" was supplied

Please help.  Thanks!








[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread Bert Gunter
I know nothing about tf, etc., but can you not simply read in the whole
file into R and then randomly split using R? The training and test sets
would simply be defined by a single random sample of subscripts which is
either chosen or not.

e.g. (simplified example -- you would be subsetting the rows of your full
dataset):

> x<- 1:10
> samp <- sort(sample(x,5))
> x[samp] ## training
[1] 3 4 6 7 8
> x[-samp] ## test
[1]  1  2  5  9 10

Apologies if my ignorance means this can't work.

Cheers,
Bert


On Fri, Aug 11, 2023 at 7:17 AM James C Schopf  wrote:

> Hello, I'd be very grateful for your help.
>
> I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv
> files, one for training an algorithm and the other for testing the
> algorithm.  I applied similar preprocessing, including TFIDF
> transformation, to both sets, but R won't let me make predictions on the
> test set due to a different TFIDF matrix.
> I get the error message:
>
> Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type
> "nmatrix.27118" was supplied
>
> I'd greatly appreciate a suggestion to overcome this problem.
> Thanks!
>
>
> Here's my R codes:
>
> > library(tidyverse)
> > library(tidytext)
> > library(caret)
> > library(kernlab)
> > library(tokenizers)
> > library(tm)
> > library(e1071)
>
> ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2
> (labelled M2)
> > url <- "D:/test/M2_75.csv"
> > d <- read_csv(url)
> ***CREATE TEXT CORPUS FROM TEXT COLUMN
> > train_text_corpus <- Corpus(VectorSource(d$Text))
> ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> > tokenize_document <- function(doc) {
> + doc_tokens <- unlist(tokenize_words(doc))
> + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
> + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
> + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
> + return(all_tokens)
> + }
> ***APPLY TOKENS TO DOCUMENTS
> > all_train_tokens <- lapply(train_text_corpus, tokenize_document)
> ***CREATE A DTM FROM THE TOKENS
> > train_text_dtm <-
> DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))
> ***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> > train_text_tfidf <- weightTfIdf(train_text_dtm)
> ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA
> > trainData <- data.frame(M2 = d$M2)
> ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME
> > trainData$text_tfidf <- I(as.matrix(train_text_tfidf))
> ***DEFINE THE ML MODEL
> > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2,
> classProbs = TRUE)
> ***TRAIN SVM
> > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial",
> trControl = ctrl)
> ***SAVE SVM
> > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")
>
> R code on my test set, which didn't work at last step:
>
> ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2
> (labelled M2)
> > url <- "D:/test/M2_25.csv"
> > d <- read_csv(url)
> ***CREATE TEXT CORPUS FROM TEXT COLUMN
> > test_text_corpus <- Corpus(VectorSource(d$Text))
> ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> > tokenize_document <- function(doc) {
>  doc_tokens <- unlist(tokenize_words(doc))
>  doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
>  doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
>  all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
>  return(all_tokens)
>  }
> ***APPLY TOKEN TO DOCUMENTS
> > all_test_tokens <- lapply(test_text_corpus, tokenize_document)
> ***CREATE A DTM FROM THE TOKENS
> > test_text_dtm <-
> DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))
> ***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> > test_text_tfidf <- weightTfIdf(test_text_dtm)
> ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA
> > testData <- data.frame(M2 = d$M2)
> ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA
> > testData$text_tfidf <- I(as.matrix(test_text_tfidf))
> ***LOAD OLD MODEL
> model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS")
>  ***MAKE PREDICTIONS
> predictions <- predict(model_svmRadial, newdata = testData)
>
> This last line produces the error message:
>
> Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type
> "nmatrix.27118" was supplied
>
> Please help.  Thanks!
>
>
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide comm

Re: [R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread Ivan Krylov
В Fri, 11 Aug 2023 10:20:27 +
James C Schopf  пишет:

> > train_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))  

> > test_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))

I understand the need to prepare the test dataset separately
(e.g. in order to be able to work with text that don't exist at the
time when model is trained), but since the model has no representation
for tokens it (well, the tokeniser) hasn't seen during the training
process, you have to ensure that test_text_dtm references exactly the
same tokens as train_text_dtm, in the same order of the columns.

Also, it probably makes sense to reuse the term frequency learned on
the training document set; otherwise you may be importance-weighting
different tokens than ones your SVM has learned as important if your
test set has a significantly different distribution from that of the
training set.

Bert is probably right: with the API given by the tm package, it's
seems easiest to tokenise and weight document-term matrices first, then
split them into the train and test subsets. It may be worth asking the
maintainer about applying previously "learned" transformations to new
corpora.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Different TFIDF settings in test set prevent testing model

2023-08-11 Thread James C Schopf
Thank you Bert and Ivan,

I was building the SVM model in hopes of applying it to future cases and hoped 
that the model would be able to deal with new words it hadn't encountered 
during training.   But I tried Bert's suggestion by converting all of the data 
to tokens, creating a DTM, transforming the whole thing with TFI DF, and then 
separating it 75%/25%.  But when I began to train the SVM on the training data, 
R said it needed 26GB for a vector and crashed. I tried again, it crashed 
again.I don't know why this would happen.  I'd just trained 4 SVM models 
using my previous method without any memory trouble on my 8GB CPU.I 
unsuccessfully tried to remove the new words from the new test data. Should I 
try that?  Is there a way to stop my system from crashing with the new method?

Thank you for any ideas.

Here is the code I used when I separated the data after converting to tokens 
and applying TFI DF:

url <- "D:/test/M2.csv"
data <- read_csv(url)
text_corpus <- Corpus(VectorSource(data$Text))
tokenize_document <- function(doc) {
doc_tokens <- unlist(tokenize_words(doc))
doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
return(all_tokens)
}
all_tokens <- lapply(text_corpus, tokenize_document)
text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_tokens)))
text_tfidf <- weightTfIdf(text_dtm)
processed_data <- data.frame(M2 = data$M2, text_tfidf = as.matrix(text_tfidf))
indexes <- createDataPartition(processed_data$M2, p = 0.75, list = FALSE)
trainData <- processed_data[indexes,]
testData <- processed_data[-indexes,]
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs 
= TRUE)
model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", 
trControl = ctrl)







From: Ivan Krylov 
Sent: Saturday, August 12, 2023 12:49 AM
To: James C Schopf 
Cc: r-help@r-project.org 
Subject: Re: [R] Different TFIDF settings in test set prevent testing model

� Fri, 11 Aug 2023 10:20:27 +
James C Schopf  �:

> > train_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))

> > test_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))

I understand the need to prepare the test dataset separately
(e.g. in order to be able to work with text that don't exist at the
time when model is trained), but since the model has no representation
for tokens it (well, the tokeniser) hasn't seen during the training
process, you have to ensure that test_text_dtm references exactly the
same tokens as train_text_dtm, in the same order of the columns.

Also, it probably makes sense to reuse the term frequency learned on
the training document set; otherwise you may be importance-weighting
different tokens than ones your SVM has learned as important if your
test set has a significantly different distribution from that of the
training set.

Bert is probably right: with the API given by the tm package, it's
seems easiest to tokenise and weight document-term matrices first, then
split them into the train and test subsets. It may be worth asking the
maintainer about applying previously "learned" transformations to new
corpora.

--
Best regards,
Ivan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] geom_smooth

2023-08-11 Thread Thomas Subia via R-help
Colleagues,

Here is my reproducible code for a graph using geom_smooth
set.seed(55)
scatter_data <- tibble(x_var = runif(100, min = 0, max = 25)
   ,y_var = log2(x_var) + rnorm(100))

library(ggplot2)
library(cowplot)

ggplot(scatter_data,aes(x=x_var,y=y_var))+
  geom_point()+
  geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+
  theme_cowplot()

I'd like to add a black boundary around the shaded area. I suspect this can be 
done with geom_ribbon but I cannot figure this out. Some advice would be 
welcome.

Thanks!

Thomas Subia

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] geom_smooth

2023-08-11 Thread Rui Barradas

Às 05:17 de 12/08/2023, Thomas Subia via R-help escreveu:

Colleagues,

Here is my reproducible code for a graph using geom_smooth
set.seed(55)
scatter_data <- tibble(x_var = runif(100, min = 0, max = 25)
    ,y_var = log2(x_var) + rnorm(100))

library(ggplot2)
library(cowplot)

ggplot(scatter_data,aes(x=x_var,y=y_var))+
   geom_point()+
   geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+
   theme_cowplot()

I'd like to add a black boundary around the shaded area. I suspect this can be 
done with geom_ribbon but I cannot figure this out. Some advice would be 
welcome.

Thanks!

Thomas Subia

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

Here is a solution. You ,ust access the computed variables, which you 
can with ?ggplot_build.

Then pass them in the data argument.



p <- ggplot(scatter_data,aes(x=x_var,y=y_var)) +
  geom_point()+
  geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+
  theme_cowplot()

# this is a data.frame, relevant columns are x,  ymin and ymax
fit <- ggplot_build(p)$data[[2]]

p +
  geom_line(data = fit, aes(x, ymin), linetype = "dashed", linewidth = 1) +
  geom_line(data = fit, aes(x, ymax), linetype = "dashed", linewidth = 1)


Hope this helps,

Rui Barradas

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] geom_smooth

2023-08-11 Thread Berwin A Turlach
G'day Thomas,

On Sat, 12 Aug 2023 04:17:42 + (UTC)
Thomas Subia via R-help  wrote:

> Here is my reproducible code for a graph using geom_smooth

The call "library(tidyverse)" was missing. :)

> I'd like to add a black boundary around the shaded area. I suspect
> this can be done with geom_ribbon but I cannot figure this out. Some
> advice would be welcome.

This works for me:

ggplot(scatter_data,aes(x=x_var,y=y_var,))+
  geom_point()+
  geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed") +
  geom_ribbon(stat="smooth", aes(ymin=after_stat(ymin), ymax=after_stat(ymax)), 
fill=NA, color="black")+
  theme_cowplot()

Cheers,

Berwin

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] geom_smooth

2023-08-11 Thread CALUM POLWART
+ geom_ribbon(stat = "smooth",
  se = TRUE,
  alpha = 0, # or, use fill = NA
  colour = "black",
  linetype = "dotted")

Does that work?



On Sat, 12 Aug 2023, 06:12 Rui Barradas,  wrote:

> Às 05:17 de 12/08/2023, Thomas Subia via R-help escreveu:
> > Colleagues,
> >
> > Here is my reproducible code for a graph using geom_smooth
> > set.seed(55)
> > scatter_data <- tibble(x_var = runif(100, min = 0, max = 25)
> > ,y_var = log2(x_var) + rnorm(100))
> >
> > library(ggplot2)
> > library(cowplot)
> >
> > ggplot(scatter_data,aes(x=x_var,y=y_var))+
> >geom_point()+
> >geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+
> >theme_cowplot()
> >
> > I'd like to add a black boundary around the shaded area. I suspect this
> can be done with geom_ribbon but I cannot figure this out. Some advice
> would be welcome.
> >
> > Thanks!
> >
> > Thomas Subia
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> Here is a solution. You ,ust access the computed variables, which you
> can with ?ggplot_build.
> Then pass them in the data argument.
>
>
>
> p <- ggplot(scatter_data,aes(x=x_var,y=y_var)) +
>geom_point()+
>geom_smooth(se=TRUE,fill="blue",color="black",linetype="dashed")+
>theme_cowplot()
>
> # this is a data.frame, relevant columns are x,  ymin and ymax
> fit <- ggplot_build(p)$data[[2]]
>
> p +
>geom_line(data = fit, aes(x, ymin), linetype = "dashed", linewidth = 1)
> +
>geom_line(data = fit, aes(x, ymax), linetype = "dashed", linewidth = 1)
>
>
> Hope this helps,
>
> Rui Barradas
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.