[R] boxplots and dotplots by color group 1 and shape group2
Dear R users, I generated a boxplots in combination of dotplots using the R code below for the attached test file. boxplot(value~score, data = test, outpch = NA, xlab="",ylab="",xaxt='n', cex.lab=1.2, cex.axis=1.2, main="Correlation of SCNV score and SCNV value ") mtext(side=2, "SCNV value", line=2.3, cex = 1.3 ) stripchart(value~score, data = test, vertical = TRUE, method = "jitter", pch = 16, col = c("green", "brown4","black", "red"), add = TRUE) now I hope to add a second group variable purity_cat in the test file, labeling it by four different shape, pch=c(3,4,16,8). Is it possible to do add the second group variable? Thank you, Ding -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] how to include if conditions in dplyr filter function
Hi R users, I thought the follow R code should work, but I got error, Can you fix my code? Thank you, Ding outlier_tcga_MAD3 <- outlier_tcga %>% filter(n_two >0) %>% mutate(freqMAD3_gain2ratio = N_MAD3_gain2/n_two )%>% if (N_MAD3 < 9) {filter(freqMAD3_gain >=1)} else if (N_MAD3 > n_two*2 ) {filter (freqMAD3_gain >= 0.8 & freqMAD3_gain2ratio >= 0.33)} else {filter(freqMAD3_gain2 >=0.3 )} %>% arrange(desc(N_MAD3)) Error in if (.) N_MAD3 < 9 else { : argument is not interpretable as logical In addition: Warning message: In if (.) N_MAD3 < 9 else { : the condition has length > 1 and only the first element will be used -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] my R code worked well when running the first 1000 lines of R code
Hi R users, The following code worked well to summarize four data groups in a dataframe for three variables (t_depth, t_alt_count, t_alt_ratio), 12 columns of summary, see attached. However, after running another 2000 lines of R codes using functions from more than 10 other R libraries, then it only generated one column of summary. Do you know why? Thank you, Yuan Chun Ding summary_anno1148ft <- anno1148ft %>% pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>% group_by(dat, measure) %>% summarize(minimum = min(value,na.rm=T), q25 = quantile(value, probs = 0.25,na.rm=T), med = median(value,na.rm=T), q75 = quantile(value, probs = 0.75,na.rm=T), maximum = max(value,na.rm=T), average = mean(value,na.rm=T), #standard_deviation = sd(value), .groups = "drop" ) summary_anno1148ft <-t(summary_anno1148ft) -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] my R code worked well when running the first 1000 lines of R code
I am sorry that I know I should provide a dataset that allows to replicate my problem. It is a research dataset and quite large, so I can not share. Both Bert and Tim guessed my problem correctly. I also thought about the conflicting issue between different packages and function masking. I just hope to that someone has similar experience, so providing me suggestion. For conflicting issue, What I tried was to add dplyr::pivot_longer or tidyr:: pivot_longer, but still not resolved the problem. I will restart from the first line my code, it will work again and then I will track down. Thank you, Ding From: CALUM POLWART Sent: Wednesday, June 12, 2024 10:52 AM To: Yuan Chun Ding Cc: r-help@r-project.org Subject: Re: [R] my R code worked well when running the first 1000 lines of R code I sometimes think people on this list are quite rude to posters. I'm afraid I'm likely to join in with some rudeness? 1. "Here is some code that works but also doesn't" is probably not going to get you an answer 2. I provide I sometimes think people on this list are quite rude to posters. I'm afraid I'm likely to join in with some rudeness? 1. "Here is some code that works but also doesn't" is probably not going to get you an answer 2. I provide no information about the data it works on or doesn't 3. I tell you I'm using a load of dependencies, but don't tell you what 4. I refer to 2000 lines of code but probably means 2000 lines of data? So. Please post a question someone can actually answer. If the question is "why might code fail on a 2000 line dataset when it works on 1000 line dataset" then here are some thoughts: * Is the 1000 lines being run as dataset[1:1000,] or is it dataset1 and dataset2 ? * Is there a structural difference in the datasets - i.e. numbers, characters or factors as columns. Often import functions guess a column type by reading the first 500/1000 lines. If the data has numbers in column 1 for 1-1000 but on line 1999 has a letter... The data type may vary. On Wed, 12 Jun 2024, 17:28 Yuan Chun Ding via R-help, mailto:r-help@r-project.org>> wrote: Hi R users, The following code worked well to summarize four data groups in a dataframe for three variables (t_depth, t_alt_count, t_alt_ratio), 12 columns of summary, see attached. However, after running another 2000 lines of R codes using functions from more than 10 other R libraries, then it only generated one column of summary. Do you know why? Thank you, Yuan Chun Ding summary_anno1148ft <- anno1148ft %>% pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>% group_by(dat, measure) %>% summarize(minimum = min(value,na.rm=T), q25 = quantile(value, probs = 0.25,na.rm=T), med = median(value,na.rm=T), q75 = quantile(value, probs = 0.75,na.rm=T), maximum = max(value,na.rm=T), average = mean(value,na.rm=T), #standard_deviation = sd(value), .groups = "drop" ) summary_anno1148ft <-t(summary_anno1148ft) -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to rec eive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) __ R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!p3fE1cCl7_IxAOT0Fvr1vPWF3xDeYl1FCDaqXi4Z6HH7tOMmDULawS8DAa7XcG5s5PrfqmeMC0XA$> PLEASE do read the posting guide ht
Re: [R] my R code worked well when running the first 1000 lines of R code
Hi Rui, Thank you very much! Yes, I verified using real data, it worked correctly as expected after adding tidyr:: to the pivot_longer function and dplyr:: to the group_by and summarize Function. I did not know how to assign the tidyr and dplyr to the three functions because I do not really understand well the three functions and just got the code from a google search. I also tried your simplified code, but got the following error Error in `dplyr::summarize()`: ! Can't supply both `.by` and `.groups`. Run `rlang::last_trace()` to see where the error occurred. Ding From: Rui Barradas Sent: Wednesday, June 12, 2024 11:29 AM To: Yuan Chun Ding ; CALUM POLWART Cc: r-help@r-project.org Subject: Re: [R] my R code worked well when running the first 1000 lines of R code Hello, Inline. Às 19: 03 de 12/06/2024, Yuan Chun Ding via R-help escreveu: > I am sorry that I know I should provide a dataset that allows to replicate my problem. > > It is a research dataset and quite large, so I can not share. > Hello, Inline. Às 19:03 de 12/06/2024, Yuan Chun Ding via R-help escreveu: > I am sorry that I know I should provide a dataset that allows to replicate my > problem. > > It is a research dataset and quite large, so I can not share. > > Both Bert and Tim guessed my problem correctly. I also thought about the > conflicting issue between different packages and function masking. > I just hope to that someone has similar experience, so providing me > suggestion. > > For conflicting issue, > > What I tried was to add dplyr::pivot_longer or tidyr:: pivot_longer, Do that to all functions comming from contributed packages. At least to those. summary_anno1148ft <- anno1148ft %>% tidyr::pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>% dplyr::group_by(dat, measure) %>% dplyr::summarize(minimum = min(value,na.rm=T), q25 = quantile(value, probs = 0.25,na.rm=T), med = median(value,na.rm=T), q75 = quantile(value, probs = 0.75,na.rm=T), maximum = max(value,na.rm=T), average = mean(value,na.rm=T), #standard_deviation = sd(value), .groups = "drop" ) Or, simpler, no need to group_by anymore. It can be done in summarise. summary_anno1148ft <- anno1148ft %>% tidyr::pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>% dplyr::summarize(minimum = min(value,na.rm=T), q25 = quantile(value, probs = 0.25,na.rm=T), med = median(value,na.rm=T), q75 = quantile(value, probs = 0.75,na.rm=T), maximum = max(value,na.rm=T), average = mean(value,na.rm=T), #standard_deviation = sd(value), .by = c(dat, measure), .groups = "drop" ) This is only a guess, the question cannot really be answered. Hope this helps, Rui Barradas but still not resolved the problem. > > > > I will restart from the first line my code, it will work again and then I > will track down. > > > > Thank you, > > Ding > > > From: CALUM POLWART mailto:polc1...@gmail.com>> > Sent: Wednesday, June 12, 2024 10:52 AM > To: Yuan Chun Ding mailto:ycd...@coh.org>> > Cc: r-help@r-project.org<mailto:r-help@r-project.org> > Subject: Re: [R] my R code worked well when running the first 1000 lines of R > code > > I sometimes think people on this list are quite rude to posters. I'm afraid > I'm likely to join in with some rudeness? 1. "Here is some code that works > but also doesn't" is probably not going to get you an answer 2. I provide > > > I sometimes think people on this list are quite rude to posters. > > I'm afraid I'm likely to join in with some rudeness? > > 1. "Here is some code that works but also doesn't" is probably not going to > get you an answer > 2. I provide no information about the data it works on or doesn't > 3. I tell you I'm using a load of dependencies, but don't tell you what > 4. I refer to 2000 lines of code but probably means 2000 lines of data? > > So. Please post a question someone can actually answer. > > If the question is "why might code fail on a 2000 line dataset when it works > on 1000 line dataset" then here are some thoughts: > > * Is the 1000 lines being run as dataset[1:1000,] or is it dataset1 and > dataset2 ? > * Is there a structural difference in the datasets - i.e. numbers, characters > or factors as columns. Often im
[R] please help generate a square correlation matrix
Hi R users, I generated a square correlation matrix for the dat dataframe below; dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0), g2=c(0,1,0,1,0,1,1,0,0), g3=c(1,1,0,0,0,1,0,0,0), g4=c(0,1,0,1,1,1,1,1,0)) library("Hmisc") dat.rcorr = rcorr(as.matrix(dat)) dat.r <-round(dat.rcorr$r,2) however, I want to modify this correlation calculation; my dat has more than 1000 rows and 22 columns; in each column, less than 10% values are 1, most of them are 0; so I want to remove a row with value of zero in both columns when calculate correlation between two columns. I just want to check whether those values of 1 are correlated between two columns. Please look at my code in the following; cor.4gene <-matrix(0,nrow=4*4, ncol=4) for (i in 1:4){ #i=1 for (j in 1:4) { #j=1 d <-dat[,c(i,j)]%>% filter(eval(as.symbol(colnames(dat)[i]))!=0 | eval(as.symbol(colnames(dat)[j]))!=0) c <-cor.test(d[,1],d[,2]) cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j], c$estimate,c$p.value) } } cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0) colnames(cor.4gene)<-c("gene1","gene2","cor","P") Can you tell me what mistakes I made? first, why cor is NA when calculation of correlation for g1 and g1, I though it should be 1. cor.4gene$cor[is.na(cor.4gene$cor)]<-1 cor.4gene$cor[is.na(cor.4gene$P)]<-0 cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor) Then this line of code above did not generate a square matrix as what the HMisc library did. How to fix my code? Thank you, Ding -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to rec eive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] please help generate a square correlation matrix
HI Rui, Thank you for the help! You did not remove a row if zero values exist in both column pair, right? Ding From: Rui Barradas Sent: Thursday, July 25, 2024 11:15 AM To: Yuan Chun Ding ; r-help@r-project.org Subject: Re: [R] please help generate a square correlation matrix Às 17: 39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > I generated a square correlation matrix for the dat dataframe below; > dat<-data. frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > g3=c(1,1,0,0,0,1,0,0,0), Às 17:39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > I generated a square correlation matrix for the dat dataframe below; > dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > g3=c(1,1,0,0,0,1,0,0,0), > g4=c(0,1,0,1,1,1,1,1,0)) > library("Hmisc") > dat.rcorr = rcorr(as.matrix(dat)) > dat.r <-round(dat.rcorr$r,2) > > however, I want to modify this correlation calculation; > my dat has more than 1000 rows and 22 columns; > in each column, less than 10% values are 1, most of them are 0; > so I want to remove a row with value of zero in both columns when calculate > correlation between two columns. > I just want to check whether those values of 1 are correlated between two > columns. > Please look at my code in the following; > > cor.4gene <-matrix(0,nrow=4*4, ncol=4) > for (i in 1:4){ >#i=1 >for (j in 1:4) { > #j=1 > d <-dat[,c(i,j)]%>% >filter(eval(as.symbol(colnames(dat)[i]))!=0 | > eval(as.symbol(colnames(dat)[j]))!=0) > c <-cor.test(d[,1],d[,2]) > cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j], > c$estimate,c$p.value) >} > } > cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0) > colnames(cor.4gene)<-c("gene1","gene2","cor","P") > > Can you tell me what mistakes I made? > first, why cor is NA when calculation of correlation for g1 and g1, I though > it should be 1. > > cor.4gene$cor[is.na(cor.4gene$cor)]<-1 > cor.4gene$cor[is.na(cor.4gene$P)]<-0 > cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor) > > Then this line of code above did not generate a square matrix as what the > HMisc library did. > How to fix my code? > > Thank you, > > Ding > > > -- > > -SECURITY/CONFIDENTIALITY WARNING- > > This message and any attachments are intended solely for the individual or > entity to which they are addressed. This communication may contain > information that is privileged, confidential, or exempt from disclosure under > applicable law (e.g., personal health information, research data, financial > information). Because this e-mail has been sent without encryption, > individuals other than the intended recipient may be able to view the > information, forward it to others or tamper with the information without the > knowledge or consent of the sender. If you are not the intended recipient, or > the employee or person responsible for delivering the message to the intended > recipient, any dissemination, distribution or copying of the communication is > strictly prohibited. If you received the communication in error, please > notify the sender immediately by replying to this message and deleting the > message and any accompanying files from your system. If, due to the security > risks, you do not wish to rec > eive further communications via e-mail, please reply to this message and > inform the sender that you do not wish to receive further e-mail from the > sender. (LCP301) > > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To > UNSUBSCRIBE and more, see > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb8338TBM$<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb8338TBM$> > PLEASE do read the posting guide > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb880tLw0$<https://urldefense.co
Re: [R] please help generate a square correlation matrix
Hi Rui, You are always very helpful!! Thank you, I just modified your R codes to remove a row with zero values in both column pair as below for my real data. Ding dat<-gene22mut.coded r <- P <- matrix(NA, nrow = 22L, ncol = 22L, dimnames = list(names(dat), names(dat))) for(i in 1:22) { #i=1 x <- dat[[i]] for(j in (1:22)) { #j=2 if(i == j) { # there's nothing to test, assign correlation 1 r[i, j] <- 1 } else { tmp <-cbind(x,dat[[j]]) row0 <-rowSums(tmp) tem2 <-tmp[row0!=0,] tmp3 <- cor.test(tem2[,1],tem2[,2]) r[i, j] <- tmp3$estimate P[i, j] <- tmp3$p.value } } } r<-as.data.frame(r) P<-as.data.frame(P) From: R-help On Behalf Of Yuan Chun Ding via R-help Sent: Thursday, July 25, 2024 11:26 AM To: Rui Barradas ; r-help@r-project.org Subject: Re: [R] please help generate a square correlation matrix HI Rui, Thank you for the help! You did not remove a row if zero values exist in both column pair, right? Ding From: Rui Barradas Sent: Thursday, July 25, 2024 11: 15 AM To: Yuan Chun Ding ; HI Rui, Thank you for the help! You did not remove a row if zero values exist in both column pair, right? Ding From: Rui Barradas mailto:ruipbarra...@sapo.pt>> Sent: Thursday, July 25, 2024 11:15 AM To: Yuan Chun Ding mailto:ycd...@coh.org>>; r-help@r-project.org<mailto:r-help@r-project.org> Subject: Re: [R] please help generate a square correlation matrix Às 17: 39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > I generated a square correlation matrix for the dat dataframe below; > dat<-data. frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > g3=c(1,1,0,0,0,1,0,0,0), Às 17:39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > I generated a square correlation matrix for the dat dataframe below; > dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > g3=c(1,1,0,0,0,1,0,0,0), > g4=c(0,1,0,1,1,1,1,1,0)) > library("Hmisc") > dat.rcorr = rcorr(as.matrix(dat)) > dat.r <-round(dat.rcorr$r,2) > > however, I want to modify this correlation calculation; > my dat has more than 1000 rows and 22 columns; > in each column, less than 10% values are 1, most of them are 0; > so I want to remove a row with value of zero in both columns when calculate > correlation between two columns. > I just want to check whether those values of 1 are correlated between two > columns. > Please look at my code in the following; > > cor.4gene <-matrix(0,nrow=4*4, ncol=4) > for (i in 1:4){ >#i=1 >for (j in 1:4) { > #j=1 > d <-dat[,c(i,j)]%>% >filter(eval(as.symbol(colnames(dat)[i]))!=0 | > eval(as.symbol(colnames(dat)[j]))!=0) > c <-cor.test(d[,1],d[,2]) > cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j], > c$estimate,c$p.value) >} > } > cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0) > colnames(cor.4gene)<-c("gene1","gene2","cor","P") > > Can you tell me what mistakes I made? > first, why cor is NA when calculation of correlation for g1 and g1, I though > it should be 1. > > cor.4gene$cor[is.na(cor.4gene$cor)]<-1 > cor.4gene$cor[is.na(cor.4gene$P)]<-0 > cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor) > > Then this line of code above did not generate a square matrix as what the > HMisc library did. > How to fix my code? > > Thank you, > > Ding > > > -- > > -SECURITY/CONFIDENTIALITY WARNING- > > This message and any attachments are intended solely for the individual or > entity to which they are addressed. This communication may contain > information that is privileged, confidential, or exempt from disclosure under > applicable law (e.g., personal health information, research data, financial > information). Because this e-mail has been sent without encryption, > individuals other than the intended recipient may be able to view the > information, forward it to others or tamper with the information without the > knowledge or consent of the sender. If you are not the intended recipient, or > the employee or person responsible for delivering the message to the intended > recipient, any dissemination, distribution or copying of the communicat
Re: [R] please help generate a square correlation matrix
>> in each column, less than 10% values are 1, most of them are 0; > > > > > > > > > > > >> so I want to remove a row with value of zero in both columns when > > >> calculate correlation between two columns. > > > > > So we're talking about correlations between binary variables. > Suppose we have two 0-1-valued variables, x and y. > Let A <- sum(x*y) # number of cases where x and y are both 1. > Let B <- sum(x)-a # number of cases where x is 1 and y is 0 > Let C <- sum(y)-a # number of cases where y is 1 and x is 0 > Let D <- sum(!x * !y) # number of cases where x and y are both 0. > > N > > On Fri, 26 Jul 2024 at 12:07, Bert Gunter > mailto:bgunter.4...@gmail.com>> wrote: > > > > If I have understood the request, I'm not sure that omitting all 0 > > pairs for each pair of columns makes much sense, but be that as it > > may, here's another way to do it by using the 'FUN' argument of combn > > to encapsulate any calculations that you do. I just use cor() as the > > calculation -- you can use anything you like that takes two vectors of > > 0's and 1's and produces fixed length numeric results (or fromm which > > you can extract such). > > > > I encapsulated it all in a little function. Note that I first > > converted the data frame to a matrix. Because of their generality, > > data frames carry a lot of extra baggage that can slow purely numeric > > manipulations down. > > > > Anyway, here's the function, 'somecors' (I'm a bad name picker :( ! ) > > > >somecors <- function(dat, func = cor){ > > dat <- as.matrix(dat) > > indx <- seq_len(ncol(dat)) > > combn(indx, 2, FUN = \(z) { > > i <- z[1]; j <- z[2] > > k <- dat[, i ] | dat[, j ] > > c(z,func(dat[k,i ], dat[k,j ])) > > }) > >} > > > > Results come out as a matrix with combn(ncol(dat),2) columns, the > > first 2 rows giving the pair of column numbers for each column,and > > then 1 or more rows (possibly extracted) from whatever func you use. > > Here's the results for your data formatted to 2 decimal places: > > > > > round(somecors(dat),2) > > [,1] [,2] [,3] [,4] [,5] [,6] > > [1,] 1.0 1.00 1.00 2.002 3.00 > > [2,] 2.0 3.00 4.00 3.004 4.00 > > [3,] -0.5 -0.41 -0.35 -0.41 NA -0.47 > > Warning message: > > In func(dat[k, i], dat[k, j]) : the standard deviation is zero > > > > The NA and warning comes in the 2,4 pair of columns because after > > removing all zero rows in the pair, dat[,4] is all 1's, giving a zero > > in the denominator of the cor() calculation -- again, assuming I have > > correctly understood your request. If so, this might be something you > > need to worry about. > > > > Again, feel free to ignore if I have misinterpreterd or this does not suit. > > > > Cheers, > > Bert > > > > > > On Thu, Jul 25, 2024 at 2:01 PM Rui Barradas > > mailto:ruipbarra...@sapo.pt>> wrote: > > > > > > Às 20:47 de 25/07/2024, Yuan Chun Ding escreveu: > > > > Hi Rui, > > > > > > > > You are always very helpful!! Thank you, > > > > > > > > I just modified your R codes to remove a row with zero values in both > > > > column pair as below for my real data. > > > > > > > > Ding > > > > > > > > dat<-gene22mut.coded > > > > r <- P <- matrix(NA, nrow = 22L, ncol = 22L, > > > > dimnames = list(names(dat), names(dat))) > > > > > > > > for(i in 1:22) { > > > >#i=1 > > > >x <- dat[[i]] > > > >for(j in (1:22)) { > > > > #j=2 > > > > if(i == j) { > > > ># there's nothing to test, assign correlation 1 > > > >r[i, j] <- 1 > > > > } else { > > > >tmp <-cbind(x,dat[[j]]) > > > >row0 <-rowSums(tmp) > > > >tem2 <-tmp[row0!=0,] > > > >tmp3 <- cor.test(tem2[,1],tem2[,2]) > > > >r[i, j] <- tmp3$estimate > > > >P[i, j] <- tmp3$p.value > > > > } > > > >} > > > > } > > > > r<-as.data.frame(r) > > >
Re: [R] please help generate a square correlation matrix
and x is 0 > > > Let D <- sum(!x * !y) # number of cases where x and y are both 0. > > > > > > N > > > > > > On Fri, 26 Jul 2024 at 12:07, Bert Gunter > > mailto:bgunter.4...@gmail.com>> wrote: > > > > > > > > If I have understood the request, I'm not sure that omitting all 0 > > > > pairs for each pair of columns makes much sense, but be that as it > > > > may, here's another way to do it by using the 'FUN' argument of combn > > > > to encapsulate any calculations that you do. I just use cor() as the > > > > calculation -- you can use anything you like that takes two vectors of > > > > 0's and 1's and produces fixed length numeric results (or fromm which > > > > you can extract such). > > > > > > > > I encapsulated it all in a little function. Note that I first > > > > converted the data frame to a matrix. Because of their generality, > > > > data frames carry a lot of extra baggage that can slow purely numeric > > > > manipulations down. > > > > > > > > Anyway, here's the function, 'somecors' (I'm a bad name picker :( ! ) > > > > > > > >somecors <- function(dat, func = cor){ > > > > dat <- as.matrix(dat) > > > > indx <- seq_len(ncol(dat)) > > > > combn(indx, 2, FUN = \(z) { > > > > i <- z[1]; j <- z[2] > > > > k <- dat[, i ] | dat[, j ] > > > > c(z,func(dat[k,i ], dat[k,j ])) > > > > }) > > > >} > > > > > > > > Results come out as a matrix with combn(ncol(dat),2) columns, the > > > > first 2 rows giving the pair of column numbers for each column,and > > > > then 1 or more rows (possibly extracted) from whatever func you use. > > > > Here's the results for your data formatted to 2 decimal places: > > > > > > > > > round(somecors(dat),2) > > > > [,1] [,2] [,3] [,4] [,5] [,6] > > > > [1,] 1.0 1.00 1.00 2.002 3.00 > > > > [2,] 2.0 3.00 4.00 3.004 4.00 > > > > [3,] -0.5 -0.41 -0.35 -0.41 NA -0.47 > > > > Warning message: > > > > In func(dat[k, i], dat[k, j]) : the standard deviation is zero > > > > > > > > The NA and warning comes in the 2,4 pair of columns because after > > > > removing all zero rows in the pair, dat[,4] is all 1's, giving a zero > > > > in the denominator of the cor() calculation -- again, assuming I have > > > > correctly understood your request. If so, this might be something you > > > > need to worry about. > > > > > > > > Again, feel free to ignore if I have misinterpreterd or this does not > > > suit. > > > > > > > > Cheers, > > > > Bert > > > > > > > > > > > > On Thu, Jul 25, 2024 at 2:01 PM Rui Barradas > > > mailto:ruipbarra...@sapo.pt>> wrote: > > > > > > > > > > Às 20:47 de 25/07/2024, Yuan Chun Ding escreveu: > > > > > > Hi Rui, > > > > > > > > > > > > You are always very helpful!! Thank you, > > > > > > > > > > > > I just modified your R codes to remove a row with zero values in both > > > > > column pair as below for my real data. > > > > > > > > > > > > Ding > > > > > > > > > > > > dat<-gene22mut.coded > > > > > > r <- P <- matrix(NA, nrow = 22L, ncol = 22L, > > > > > > dimnames = list(names(dat), names(dat))) > > > > > > > > > > > > for(i in 1:22) { > > > > > >#i=1 > > > > > >x <- dat[[i]] > > > > > >for(j in (1:22)) { > > > > > > #j=2 > > > > > > if(i == j) { > > > > > ># there's nothing to test, assign correlation 1 > > > > > >r[i, j] <- 1 > > > > > > } else { > > > > > >tmp <-cbind(x,dat[[j]]) > > > > > >row0 <-rowSums(tmp) > > > > > > tem2 <-tmp[row0!
Re: [R] a fast way to do my job
Dear R users, I am running the following code below, the gem751be.rpkm is a dataframe with dim of 751 samples by 35164 variables, 73 phenotypic variables in the furst to 73rd column and 35091 genomic variables or genes in the 74th to 35164th columns. What I need to do is to calculate the residuals for each gene using the simple linear regression model of genelist[i] ~ purity2; The following code is running, it takes long time, but I have an expensive ThinkStation window computer. Can you provide a fast way to do it? Thank you, Ding - gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)), + by.x="id2",by.y=0) > row.names(gem751be.rpkm)<-gem751be.rpkm$id3 > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_") > genelist <- gem751be.rpkm %>% dplyr::select(74:35164) > residuals <- NULL > for (i in 1:length(genelist)) { + #i=1 + formula <- reformulate("purity2", response=names(genelist)[i]) + model <- lm(formula, data = gem751be.rpkm) + resi <- as.data.frame(residuals(model)) + colnames(resi)[1]<-names(genelist)[i] + resi <-as.data.frame(t(resi)) + residuals <- rbind(residuals, resi) + } -- -SECURITY/CONFIDENTIALITY WARNING- This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to rec eive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] a fast way to do my job
HI Bert and Ben, Yes, running lm.fit using the matrix format is much faster. I read a couple of online comments why it is faster. However, the residual values for three tested variables or genes from lm function and lm.fit function are different, with Pearson correlation of 0.55, 0.89, and 0.99. I have not found the reason. Thanks, Ding From: Bert Gunter Sent: Friday, August 9, 2024 7:11 PM To: Ben Bolker Cc: Yuan Chun Ding ; r-help@r-project.org Subject: Re: [R] a fast way to do my job Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- as.matrix(gem751be.rpkm[ , 74:35164]) ##then result <- residuals(lm.fit( x= pur2, y = dat)) Cheers, Bert On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker mailto:bbol...@gmail.com>> wrote: > > You can also fit a linear model with a matrix-valued response > variable, which should be even faster (not sure off the top of my head > how to get the residuals and reshape them to the dimensions you want) > > On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter > mailto:bgunter.4...@gmail.com>> wrote: > > > > See ?lm.fit. > > I must be missing something, because: > > > > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2, > > gem751be.rpkm[, i] ))) > > > > would give you a 751 x 35091 matrix of the residuals from each of the > > regressions. > > I assume it will be considerably faster than all the overhead you are > > carrying in your current code, but of course you'll have to try it and > > see. ... Assuming that I have interpreted your request correctly. > > Ignore if not. > > > > Cheers, > > Bert > > > > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help > > mailto:r-help@r-project.org>> wrote: > > > > > > Dear R users, > > > > > > I am running the following code below, the gem751be.rpkm is a dataframe > > > with dim of 751 samples by 35164 variables, 73 phenotypic variables in > > > the furst to 73rd column and 35091 genomic variables or genes in the 74th > > > to 35164th columns. What I need to do is to calculate the residuals for > > > each gene using the simple linear regression model of genelist[i] ~ > > > purity2; > > > > > > The following code is running, it takes long time, but I have an > > > expensive ThinkStation window computer. > > > Can you provide a fast way to do it? > > > > > > Thank you, > > > > > > Ding > > > > > > - > > > > > > > > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)), > > > + by.x="id2",by.y=0) > > > > row.names(gem751be.rpkm)<-gem751be.rpkm$id3 > > > > > > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_") > > > > genelist <- gem751be.rpkm %>% dplyr::select(74:35164) > > > > residuals <- NULL > > > > for (i in 1:length(genelist)) { > > > + #i=1 > > > + formula <- reformulate("purity2", response=names(genelist)[i]) > > > + model <- lm(formula, data = gem751be.rpkm) > > > + resi <- as.data.frame(residuals(model)) > > > + colnames(resi)[1]<-names(genelist)[i] > > > + resi <-as.data.frame(t(resi)) > > > + residuals <- rbind(residuals, resi) > > > + } > > > > > > > > > > > > -- > > > > > > -SECURITY/CONFIDENTIALITY WARNING- > > > > > > This message and any attachments are intended solely for the individual > > > or entity to which they are addressed. This communication may contain > > > information that is privileged, confidential, or exempt from disclosure > > > under applicable law (e.g., personal health information, research data, > > > financial information). Because this e-mail has been sent withou
Re: [R] a fast way to do my job
after add intercept, all residuals are the same from lm or lm.fit. Ding From: Bert Gunter Sent: Saturday, August 10, 2024 1:00 PM To: Yuan Chun Ding Cc: Ben Bolker ; r-help@r-project.org Subject: Re: [R] a fast way to do my job Probably because you inadvertently ran different models. Without your code, I haven't a clue. On Sat, Aug 10, 2024, 12: 29 Yuan Chun Ding wrote: HI Bert and Ben, Yes, running lm. fit using the matrix format is much faster. Probably because you inadvertently ran different models. Without your code, I haven't a clue. On Sat, Aug 10, 2024, 12:29 Yuan Chun Ding mailto:ycd...@coh.org>> wrote: HI Bert and Ben, Yes, running lm.fit using the matrix format is much faster. I read a couple of online comments why it is faster. However, the residual values for three tested variables or genes from lm function and lm.fit function are different, with Pearson correlation of 0.55, 0.89, and 0.99. I have not found the reason. Thanks, Ding From: Bert Gunter mailto:bgunter.4...@gmail.com>> Sent: Friday, August 9, 2024 7:11 PM To: Ben Bolker mailto:bbol...@gmail.com>> Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; r-help@r-project.org<mailto:r-help@r-project.org> Subject: Re: [R] a fast way to do my job Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- as.matrix(gem751be.rpkm[ , 74:35164]) ##then result <- residuals(lm.fit( x= pur2, y = dat)) Cheers, Bert On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker mailto:bbol...@gmail.com>> wrote: > > You can also fit a linear model with a matrix-valued response > variable, which should be even faster (not sure off the top of my head > how to get the residuals and reshape them to the dimensions you want) > > On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter > mailto:bgunter.4...@gmail.com>> wrote: > > > > See ?lm.fit. > > I must be missing something, because: > > > > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2, > > gem751be.rpkm[, i] ))) > > > > would give you a 751 x 35091 matrix of the residuals from each of the > > regressions. > > I assume it will be considerably faster than all the overhead you are > > carrying in your current code, but of course you'll have to try it and > > see. ... Assuming that I have interpreted your request correctly. > > Ignore if not. > > > > Cheers, > > Bert > > > > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help > > mailto:r-help@r-project.org>> wrote: > > > > > > Dear R users, > > > > > > I am running the following code below, the gem751be.rpkm is a dataframe > > > with dim of 751 samples by 35164 variables, 73 phenotypic variables in > > > the furst to 73rd column and 35091 genomic variables or genes in the 74th > > > to 35164th columns. What I need to do is to calculate the residuals for > > > each gene using the simple linear regression model of genelist[i] ~ > > > purity2; > > > > > > The following code is running, it takes long time, but I have an > > > expensive ThinkStation window computer. > > > Can you provide a fast way to do it? > > > > > > Thank you, > > > > > > Ding > > > > > > - > > > > > > > > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)), > > > + by.x="id2",by.y=0) > > > > row.names(gem751be.rpkm)<-gem751be.rpkm$id3 > > > > > > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_") > > > > genelist <- gem751be.rpkm %>% dplyr::select(74:35164) > > > > residuals <- NULL > > > > for (i in 1:length(genelist)) { > > > + #i=1 > > > + formula <- reformulate("purity2", response=names(genelist)[i]) > > > + model <- lm(formula, data = gem751be.rpkm) > > > + resi <- as.data.frame(residuals(model)) > > > + colnames(resi)[1]<-names(genelist)[i] > > > + resi <-as.data.frame(t(resi)) > > >
Re: [R] a fast way to do my job
You are right. I also just thought about that, no intercept is not applicable to my case. Ding From: Bert Gunter Sent: Saturday, August 10, 2024 1:06 PM To: Yuan Chun Ding Cc: Ben Bolker ; r-help@r-project.org Subject: Re: [R] a fast way to do my job Ah, messages crossed. A no-intercept model **assumes** the straight line fit must pass through the origin. Unless there is a strong justification for such an assumption, you should include an intercept. -- Bert On Sat, Aug 10, 2024 at 1: 02 PM Ah, messages crossed. A no-intercept model **assumes** the straight line fit must pass through the origin. Unless there is a strong justification for such an assumption, you should include an intercept. -- Bert On Sat, Aug 10, 2024 at 1:02 PM Bert Gunter mailto:bgunter.4...@gmail.com>> wrote: > > Is it because I failed to to add a column of ones for an intercept to > the x matrix? TRhat would be my bad. > > -- Bert > > > On Sat, Aug 10, 2024 at 12:59 PM Bert Gunter > mailto:bgunter.4...@gmail.com>> wrote: > > > > Probably because you inadvertently ran different models. Without your code, > > I haven't a clue. > > > > > > On Sat, Aug 10, 2024, 12:29 Yuan Chun Ding > > mailto:ycd...@coh.org>> wrote: > >> > >> HI Bert and Ben, > >> > >> > >> > >> Yes, running lm.fit using the matrix format is much faster. I read a > >> couple of online comments why it is faster. > >> > >> > >> > >> However, the residual values for three tested variables or genes from lm > >> function and lm.fit function are different, with Pearson correlation of > >> 0.55, 0.89, and 0.99. > >> > >> > >> > >> I have not found the reason. > >> > >> > >> > >> Thanks, > >> > >> > >> Ding > >> > >> > >> > >> From: Bert Gunter mailto:bgunter.4...@gmail.com>> > >> Sent: Friday, August 9, 2024 7:11 PM > >> To: Ben Bolker mailto:bbol...@gmail.com>> > >> Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; > >> r-help@r-project.org<mailto:r-help@r-project.org> > >> Subject: Re: [R] a fast way to do my job > >> > >> > >> > >> Better idea, Ben! It would work as you might expect it to to produce the > >> same results as the above: ##first make sure your regressor is a matrix: > >> pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into > >> a matrix dat <- > >> > >> Better idea, Ben! > >> > >> > >> > >> It would work as you might expect it to to produce the same results as > >> > >> the above: > >> > >> > >> > >> ##first make sure your regressor is a matrix: > >> > >> pur2 <- matrix(purity2, ncol =1) > >> > >> ## convert the data frame variables into a matrix > >> > >> dat <- as.matrix(gem751be.rpkm[ , 74:35164]) > >> > >> ##then > >> > >> result <- residuals(lm.fit( x= pur2, y = dat)) > >> > >> > >> > >> Cheers, > >> > >> Bert > >> > >> > >> > >> On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker > >> mailto:bbol...@gmail.com>> wrote: > >> > >> > > >> > >> > You can also fit a linear model with a matrix-valued response > >> > >> > variable, which should be even faster (not sure off the top of my head > >> > >> > how to get the residuals and reshape them to the dimensions you want) > >> > >> > > >> > >> > On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter > >> > mailto:bgunter.4...@gmail.com>> wrote: > >> > >> > > > >> > >> > > See ?lm.fit. > >> > >> > > I must be missing something, because: > >> > >> > > > >> > >> > > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2, > >> > >> > > gem751be.rpkm[, i] ))) > >> > >> > > > >> > >> > > would give you a 751 x 35091 matrix of the residuals from each of the > >> > >> > > regressions. > >> > >> > > I assume it will be considerably faster than all the overhead you are > >> > >> >
Re: [R] a fast way to do my job
Hi Bert and Ben, Thanks a lot for your suggestion!!. About the different residuals between lm function and lm.fit, from online search, lt seems like that I need to add an intercept in the design matrix x; pur2 <- matrix(gem751be.rpkm$purity2, ncol =1) pur2.1 <- cbind(1,gem751be.rpkm$purity2) then running result2 <- residuals(lm.fit( x= pur2.1, y = dat)); now I am thinking whether an intercept is required or not. Ding From: R-help On Behalf Of Yuan Chun Ding via R-help Sent: Saturday, August 10, 2024 12:30 PM To: Bert Gunter ; Ben Bolker Cc: r-help@r-project.org Subject: Re: [R] a fast way to do my job HI Bert and Ben, Yes, running lm. fit using the matrix format is much faster. I read a couple of online comments why it is faster. However, the residual values for three tested variables or genes from lm function and lm. fit function are different, HI Bert and Ben, Yes, running lm.fit using the matrix format is much faster. I read a couple of online comments why it is faster. However, the residual values for three tested variables or genes from lm function and lm.fit function are different, with Pearson correlation of 0.55, 0.89, and 0.99. I have not found the reason. Thanks, Ding From: Bert Gunter mailto:bgunter.4...@gmail.com>> Sent: Friday, August 9, 2024 7:11 PM To: Ben Bolker mailto:bbol...@gmail.com>> Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; r-help@r-project.org<mailto:r-help@r-project.org> Subject: Re: [R] a fast way to do my job Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- Better idea, Ben! It would work as you might expect it to to produce the same results as the above: ##first make sure your regressor is a matrix: pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat <- as.matrix(gem751be.rpkm[ , 74:35164]) ##then result <- residuals(lm.fit( x= pur2, y = dat)) Cheers, Bert On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker mailto:bbol...@gmail.com<mailto:bbol...@gmail.com%3cmailto:bbol...@gmail.com>>> wrote: > > You can also fit a linear model with a matrix-valued response > variable, which should be even faster (not sure off the top of my head > how to get the residuals and reshape them to the dimensions you want) > > On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter > mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com%3cmailto:bgunter.4...@gmail.com>>> > wrote: > > > > See ?lm.fit. > > I must be missing something, because: > > > > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2, > > gem751be.rpkm[, i] ))) > > > > would give you a 751 x 35091 matrix of the residuals from each of the > > regressions. > > I assume it will be considerably faster than all the overhead you are > > carrying in your current code, but of course you'll have to try it and > > see. ... Assuming that I have interpreted your request correctly. > > Ignore if not. > > > > Cheers, > > Bert > > > > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help > > mailto:r-help@r-project.org<mailto:r-help@r-project.org%3cmailto:r-help@r-project.org>>> > > wrote: > > > > > > Dear R users, > > > > > > I am running the following code below, the gem751be.rpkm is a dataframe > > > with dim of 751 samples by 35164 variables, 73 phenotypic variables in > > > the furst to 73rd column and 35091 genomic variables or genes in the 74th > > > to 35164th columns. What I need to do is to calculate the residuals for > > > each gene using the simple linear regression model of genelist[i] ~ > > > purity2; > > > > > > The following code is running, it takes long time, but I have an > > > expensive ThinkStation window computer. > > > Can you provide a fast way to do it? > > > > > > Thank you, > > > > > > Ding > > > > > > - > > > > > > > > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)), > > > + by.x="id2",by.y=0) > > > > row.names(gem751be.rpkm)<-gem751be.rpkm$id3 > > > > > > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacemen