[R] boxplots and dotplots by color group 1 and shape group2

2021-09-16 Thread Yuan Chun Ding via R-help
Dear R users,

I generated a boxplots in combination of dotplots using the R code below for 
the attached test file.
boxplot(value~score, data = test, outpch = NA, xlab="",ylab="",xaxt='n',
cex.lab=1.2, cex.axis=1.2, main="Correlation of SCNV score and SCNV 
value ")
mtext(side=2, "SCNV value", line=2.3, cex = 1.3 )
stripchart(value~score, data = test,
   vertical = TRUE, method = "jitter",
   pch = 16, col = c("green", "brown4","black", "red"),
   add = TRUE)

now I hope to add a second group variable purity_cat in the test file, labeling 
it by four different shape, pch=c(3,4,16,8).

Is it possible to do add the second group variable?

Thank you,

Ding

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] how to include if conditions in dplyr filter function

2021-10-26 Thread Yuan Chun Ding via R-help
Hi R users,

I thought the follow R code should work, but I got error, Can you fix my code?

Thank you,

Ding


outlier_tcga_MAD3 <- outlier_tcga %>% filter(n_two >0) %>% 
  mutate(freqMAD3_gain2ratio = N_MAD3_gain2/n_two )%>% 
  if (N_MAD3 < 9) {filter(freqMAD3_gain >=1)} else if (N_MAD3 > n_two*2 )
  {filter (freqMAD3_gain >= 0.8 & freqMAD3_gain2ratio >= 0.33)} 
else 
  {filter(freqMAD3_gain2 >=0.3 )} %>% 
  arrange(desc(N_MAD3))


Error in if (.) N_MAD3 < 9 else { : 
  argument is not interpretable as logical
In addition: Warning message:
In if (.) N_MAD3 < 9 else { :
  the condition has length > 1 and only the first element will be used

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] my R code worked well when running the first 1000 lines of R code

2024-06-12 Thread Yuan Chun Ding via R-help
Hi R users,

The following code worked well to summarize four data groups in a dataframe for 
three variables (t_depth, t_alt_count, t_alt_ratio), 12 columns of summary, see 
attached.
However, after running another 2000 lines of R codes using functions from more 
than 10 other R  libraries, then it only generated one column of summary.
Do you know why?

Thank you,

Yuan Chun Ding

summary_anno1148ft <- anno1148ft %>%
  pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>%
  group_by(dat, measure) %>%
  summarize(minimum = min(value,na.rm=T),
q25 = quantile(value, probs = 0.25,na.rm=T),
med = median(value,na.rm=T),
q75 = quantile(value, probs = 0.75,na.rm=T),
maximum = max(value,na.rm=T),
average = mean(value,na.rm=T),
#standard_deviation = sd(value),
.groups = "drop"
  )
summary_anno1148ft <-t(summary_anno1148ft)



--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] my R code worked well when running the first 1000 lines of R code

2024-06-12 Thread Yuan Chun Ding via R-help
I am sorry that I know I should provide a dataset that allows to replicate my 
problem.

It is a research dataset and quite large, so I can not share.

Both Bert and Tim guessed my problem correctly.  I also thought about the 
conflicting issue between different packages and function masking.
I just hope to that someone has similar experience, so providing me suggestion.

For conflicting issue,

What I tried  was to add dplyr::pivot_longer or tidyr:: pivot_longer, but still 
not resolved the problem.



I will restart from the first line my code, it will work again and then I will 
track down.



Thank you,

Ding


From: CALUM POLWART 
Sent: Wednesday, June 12, 2024 10:52 AM
To: Yuan Chun Ding 
Cc: r-help@r-project.org
Subject: Re: [R] my R code worked well when running the first 1000 lines of R 
code

I sometimes think people on this list are quite rude to posters. I'm afraid I'm 
likely to join in with some rudeness? 1. "Here is some code that works but also 
doesn't" is probably not going to get you an answer 2. I provide


I sometimes think people on this list are quite rude to posters.

I'm afraid I'm likely to join in with some rudeness?

1. "Here is some code that works but also doesn't" is probably not going to get 
you an answer
2. I provide no information about the data it works on or doesn't
3. I tell you I'm using a load of dependencies, but don't tell you what
4. I refer to 2000 lines of code but probably means 2000 lines of data?

So. Please post a question someone can actually answer.

If the question is "why might code fail on a 2000 line dataset when it works on 
1000 line dataset" then here are some thoughts:

* Is the 1000 lines being run as dataset[1:1000,] or is it dataset1 and 
dataset2 ?
* Is there a structural difference in the datasets - i.e. numbers, characters 
or factors as columns. Often import functions guess a column type by reading 
the first 500/1000 lines. If the data has numbers in column 1 for 1-1000 but on 
line 1999 has a letter... The data type may vary.

On Wed, 12 Jun 2024, 17:28 Yuan Chun Ding via R-help, 
mailto:r-help@r-project.org>> wrote:
Hi R users,

The following code worked well to summarize four data groups in a dataframe for 
three variables (t_depth, t_alt_count, t_alt_ratio), 12 columns of summary, see 
attached.
However, after running another 2000 lines of R codes using functions from more 
than 10 other R  libraries, then it only generated one column of summary.
Do you know why?

Thank you,

Yuan Chun Ding

summary_anno1148ft <- anno1148ft %>%
  pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to = "measure") %>%
  group_by(dat, measure) %>%
  summarize(minimum = min(value,na.rm=T),
q25 = quantile(value, probs = 0.25,na.rm=T),
med = median(value,na.rm=T),
q75 = quantile(value, probs = 0.75,na.rm=T),
maximum = max(value,na.rm=T),
average = mean(value,na.rm=T),
#standard_deviation = sd(value),
.groups = "drop"
  )
summary_anno1148ft <-t(summary_anno1148ft)



--

-SECURITY/CONFIDENTIALITY WARNING-

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to rec
 eive further communications via e-mail, please reply to this message and 
inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!p3fE1cCl7_IxAOT0Fvr1vPWF3xDeYl1FCDaqXi4Z6HH7tOMmDULawS8DAa7XcG5s5PrfqmeMC0XA$>
PLEASE do read the posting guide 
ht

Re: [R] my R code worked well when running the first 1000 lines of R code

2024-06-12 Thread Yuan Chun Ding via R-help
Hi Rui,

Thank you very much!


Yes, I verified using real data, it worked correctly as expected after adding 
tidyr:: to the pivot_longer function and dplyr:: to the group_by and summarize
Function.

I did not know how to assign the tidyr and dplyr to the three functions because 
I do not really understand well the three functions and just got the code from 
a google search.

I also tried your simplified code, but got the following error
Error in `dplyr::summarize()`:
! Can't supply both `.by` and `.groups`.
Run `rlang::last_trace()` to see where the error occurred.

Ding

From: Rui Barradas 
Sent: Wednesday, June 12, 2024 11:29 AM
To: Yuan Chun Ding ; CALUM POLWART 
Cc: r-help@r-project.org
Subject: Re: [R] my R code worked well when running the first 1000 lines of R 
code

Hello, Inline. Às 19: 03 de 12/06/2024, Yuan Chun Ding via R-help escreveu: > I 
am sorry that I know I should provide a dataset that allows to replicate my 
problem. > > It is a research dataset and quite large, so I can not share. >


Hello,



Inline.



Às 19:03 de 12/06/2024, Yuan Chun Ding via R-help escreveu:

> I am sorry that I know I should provide a dataset that allows to replicate my 
> problem.

>

> It is a research dataset and quite large, so I can not share.

>

> Both Bert and Tim guessed my problem correctly.  I also thought about the 
> conflicting issue between different packages and function masking.

> I just hope to that someone has similar experience, so providing me 
> suggestion.

>

> For conflicting issue,

>

> What I tried  was to add dplyr::pivot_longer or tidyr:: pivot_longer,





Do that to all functions comming from contributed packages. At least to

those.





summary_anno1148ft <- anno1148ft %>%

   tidyr::pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to =

"measure") %>%

   dplyr::group_by(dat, measure) %>%

   dplyr::summarize(minimum = min(value,na.rm=T),

q25 = quantile(value, probs = 0.25,na.rm=T),

med = median(value,na.rm=T),

q75 = quantile(value, probs = 0.75,na.rm=T),

maximum = max(value,na.rm=T),

average = mean(value,na.rm=T),

#standard_deviation = sd(value),

.groups = "drop"

   )





Or, simpler, no need to group_by anymore. It can be done in summarise.





summary_anno1148ft <- anno1148ft %>%

   tidyr::pivot_longer(c(t_depth, t_alt_count, t_alt_ratio), names_to =

"measure") %>%

   dplyr::summarize(minimum = min(value,na.rm=T),

q25 = quantile(value, probs = 0.25,na.rm=T),

med = median(value,na.rm=T),

q75 = quantile(value, probs = 0.75,na.rm=T),

maximum = max(value,na.rm=T),

average = mean(value,na.rm=T),

#standard_deviation = sd(value),

.by = c(dat, measure),

.groups = "drop"

   )







This is only a guess, the question cannot really be answered.





Hope this helps,



Rui Barradas



but still not resolved the problem.

>

>

>

> I will restart from the first line my code, it will work again and then I 
> will track down.

>

>

>

> Thank you,

>

> Ding

>

>

> From: CALUM POLWART mailto:polc1...@gmail.com>>

> Sent: Wednesday, June 12, 2024 10:52 AM

> To: Yuan Chun Ding mailto:ycd...@coh.org>>

> Cc: r-help@r-project.org<mailto:r-help@r-project.org>

> Subject: Re: [R] my R code worked well when running the first 1000 lines of R 
> code

>

> I sometimes think people on this list are quite rude to posters. I'm afraid 
> I'm likely to join in with some rudeness? 1. "Here is some code that works 
> but also doesn't" is probably not going to get you an answer 2. I provide

>

>

> I sometimes think people on this list are quite rude to posters.

>

> I'm afraid I'm likely to join in with some rudeness?

>

> 1. "Here is some code that works but also doesn't" is probably not going to 
> get you an answer

> 2. I provide no information about the data it works on or doesn't

> 3. I tell you I'm using a load of dependencies, but don't tell you what

> 4. I refer to 2000 lines of code but probably means 2000 lines of data?

>

> So. Please post a question someone can actually answer.

>

> If the question is "why might code fail on a 2000 line dataset when it works 
> on 1000 line dataset" then here are some thoughts:

>

> * Is the 1000 lines being run as dataset[1:1000,] or is it dataset1 and 
> dataset2 ?

> * Is there a structural difference in the datasets - i.e. numbers, characters 
> or factors as columns. Often im

[R] please help generate a square correlation matrix

2024-07-25 Thread Yuan Chun Ding via R-help
Hi R users,

I generated a square correlation matrix for the dat dataframe below;
dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0),
g2=c(0,1,0,1,0,1,1,0,0),
g3=c(1,1,0,0,0,1,0,0,0),
g4=c(0,1,0,1,1,1,1,1,0))
library("Hmisc")
dat.rcorr = rcorr(as.matrix(dat))
dat.r <-round(dat.rcorr$r,2)

however, I want to modify this correlation calculation;
my dat has more than 1000 rows and 22 columns;
in each column, less than 10% values are 1, most of them are 0;
so I want to remove a  row with value of zero in both columns when calculate 
correlation between two columns.
I just want to check whether those values of 1 are correlated between two 
columns.
Please look at my code in the following;

cor.4gene <-matrix(0,nrow=4*4, ncol=4)
for (i in 1:4){
  #i=1
  for (j in 1:4) {
#j=1
d <-dat[,c(i,j)]%>%
  filter(eval(as.symbol(colnames(dat)[i]))!=0 |
   eval(as.symbol(colnames(dat)[j]))!=0)
c <-cor.test(d[,1],d[,2])
cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j],
c$estimate,c$p.value)
  }
}
cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0)
colnames(cor.4gene)<-c("gene1","gene2","cor","P")

Can you tell me what mistakes I made?
first, why cor is NA when calculation of correlation for g1 and g1, I though it 
should be 1.

cor.4gene$cor[is.na(cor.4gene$cor)]<-1
cor.4gene$cor[is.na(cor.4gene$P)]<-0
cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor)

Then this line of code above did not generate a square matrix as what the HMisc 
library did.
How to fix my code?

Thank you,

Ding


--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to rec
 eive further communications via e-mail, please reply to this message and 
inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] please help generate a square correlation matrix

2024-07-25 Thread Yuan Chun Ding via R-help
HI Rui,

Thank you for the  help!

You did not remove a row if zero values exist in both column pair, right?

Ding

From: Rui Barradas 
Sent: Thursday, July 25, 2024 11:15 AM
To: Yuan Chun Ding ; r-help@r-project.org
Subject: Re: [R] please help generate a square correlation matrix

Às 17: 39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > 
I generated a square correlation matrix for the dat dataframe below; > 
dat<-data. frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > 
g3=c(1,1,0,0,0,1,0,0,0),


Às 17:39 de 25/07/2024, Yuan Chun Ding via R-help escreveu:

> Hi R users,

>

> I generated a square correlation matrix for the dat dataframe below;

> dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0),

>  g2=c(0,1,0,1,0,1,1,0,0),

>  g3=c(1,1,0,0,0,1,0,0,0),

>  g4=c(0,1,0,1,1,1,1,1,0))

> library("Hmisc")

> dat.rcorr = rcorr(as.matrix(dat))

> dat.r <-round(dat.rcorr$r,2)

>

> however, I want to modify this correlation calculation;

> my dat has more than 1000 rows and 22 columns;

> in each column, less than 10% values are 1, most of them are 0;

> so I want to remove a  row with value of zero in both columns when calculate 
> correlation between two columns.

> I just want to check whether those values of 1 are correlated between two 
> columns.

> Please look at my code in the following;

>

> cor.4gene <-matrix(0,nrow=4*4, ncol=4)

> for (i in 1:4){

>#i=1

>for (j in 1:4) {

>  #j=1

>  d <-dat[,c(i,j)]%>%

>filter(eval(as.symbol(colnames(dat)[i]))!=0 |

> eval(as.symbol(colnames(dat)[j]))!=0)

>  c <-cor.test(d[,1],d[,2])

>  cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j],

>  c$estimate,c$p.value)

>}

> }

> cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0)

> colnames(cor.4gene)<-c("gene1","gene2","cor","P")

>

> Can you tell me what mistakes I made?

> first, why cor is NA when calculation of correlation for g1 and g1, I though 
> it should be 1.

>

> cor.4gene$cor[is.na(cor.4gene$cor)]<-1

> cor.4gene$cor[is.na(cor.4gene$P)]<-0

> cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor)

>

> Then this line of code above did not generate a square matrix as what the 
> HMisc library did.

> How to fix my code?

>

> Thank you,

>

> Ding

>

>

> --

> 

> -SECURITY/CONFIDENTIALITY WARNING-

>

> This message and any attachments are intended solely for the individual or 
> entity to which they are addressed. This communication may contain 
> information that is privileged, confidential, or exempt from disclosure under 
> applicable law (e.g., personal health information, research data, financial 
> information). Because this e-mail has been sent without encryption, 
> individuals other than the intended recipient may be able to view the 
> information, forward it to others or tamper with the information without the 
> knowledge or consent of the sender. If you are not the intended recipient, or 
> the employee or person responsible for delivering the message to the intended 
> recipient, any dissemination, distribution or copying of the communication is 
> strictly prohibited. If you received the communication in error, please 
> notify the sender immediately by replying to this message and deleting the 
> message and any accompanying files from your system. If, due to the security 
> risks, you do not wish to rec

>   eive further communications via e-mail, please reply to this message and 
> inform the sender that you do not wish to receive further e-mail from the 
> sender. (LCP301)

> 

>

> [[alternative HTML version deleted]]

>

> __

> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see

> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb8338TBM$<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/r-help__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb8338TBM$>

> PLEASE do read the posting guide 
> https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!Fou38LsQmgU!tyykZkQmOKcwoWXEpV2ohbnr02thhHMabAcYLL_-7dteKHAabK-eo4rGDnwgSFjniAy8SO00L6Hb880tLw0$<https://urldefense.co

Re: [R] please help generate a square correlation matrix

2024-07-25 Thread Yuan Chun Ding via R-help
Hi Rui,

You are always very helpful!! Thank you,

I just modified your R codes to remove a row with zero values in both column 
pair as below for my real data.

Ding

dat<-gene22mut.coded
r <- P <- matrix(NA, nrow = 22L, ncol = 22L,
 dimnames = list(names(dat), names(dat)))

for(i in 1:22) {
  #i=1
  x <- dat[[i]]
  for(j in (1:22)) {
#j=2
if(i == j) {
  # there's nothing to test, assign correlation 1
  r[i, j] <- 1
} else {
  tmp <-cbind(x,dat[[j]])
  row0 <-rowSums(tmp)
  tem2 <-tmp[row0!=0,]
  tmp3 <- cor.test(tem2[,1],tem2[,2])
  r[i, j] <- tmp3$estimate
  P[i, j] <- tmp3$p.value
}
  }
}
r<-as.data.frame(r)
P<-as.data.frame(P)

From: R-help  On Behalf Of Yuan Chun Ding via 
R-help
Sent: Thursday, July 25, 2024 11:26 AM
To: Rui Barradas ; r-help@r-project.org
Subject: Re: [R] please help generate a square correlation matrix

HI Rui, Thank you for the help! You did not remove a row if zero values exist 
in both column pair, right? Ding From: Rui Barradas  
Sent: Thursday, July 25, 2024 11: 15 AM To: Yuan Chun Ding ;


HI Rui,



Thank you for the  help!



You did not remove a row if zero values exist in both column pair, right?



Ding



From: Rui Barradas mailto:ruipbarra...@sapo.pt>>

Sent: Thursday, July 25, 2024 11:15 AM

To: Yuan Chun Ding mailto:ycd...@coh.org>>; 
r-help@r-project.org<mailto:r-help@r-project.org>

Subject: Re: [R] please help generate a square correlation matrix



Às 17: 39 de 25/07/2024, Yuan Chun Ding via R-help escreveu: > Hi R users, > > 
I generated a square correlation matrix for the dat dataframe below; > 
dat<-data. frame(g1=c(1,0,0,1,1,1,0,0,0), > g2=c(0,1,0,1,0,1,1,0,0), > 
g3=c(1,1,0,0,0,1,0,0,0),





Às 17:39 de 25/07/2024, Yuan Chun Ding via R-help escreveu:



> Hi R users,



>



> I generated a square correlation matrix for the dat dataframe below;



> dat<-data.frame(g1=c(1,0,0,1,1,1,0,0,0),



>  g2=c(0,1,0,1,0,1,1,0,0),



>  g3=c(1,1,0,0,0,1,0,0,0),



>  g4=c(0,1,0,1,1,1,1,1,0))



> library("Hmisc")



> dat.rcorr = rcorr(as.matrix(dat))



> dat.r <-round(dat.rcorr$r,2)



>



> however, I want to modify this correlation calculation;



> my dat has more than 1000 rows and 22 columns;



> in each column, less than 10% values are 1, most of them are 0;



> so I want to remove a  row with value of zero in both columns when calculate 
> correlation between two columns.



> I just want to check whether those values of 1 are correlated between two 
> columns.



> Please look at my code in the following;



>



> cor.4gene <-matrix(0,nrow=4*4, ncol=4)



> for (i in 1:4){



>#i=1



>for (j in 1:4) {



>  #j=1



>  d <-dat[,c(i,j)]%>%



>filter(eval(as.symbol(colnames(dat)[i]))!=0 |



> eval(as.symbol(colnames(dat)[j]))!=0)



>  c <-cor.test(d[,1],d[,2])



>  cor.4gene[i*j,]<-c(colnames(dat)[i],colnames(dat)[j],



>  c$estimate,c$p.value)



>}



> }



> cor.4gene<-as.data.frame(cor.4gene)%>%filter(V1 !=0)



> colnames(cor.4gene)<-c("gene1","gene2","cor","P")



>



> Can you tell me what mistakes I made?



> first, why cor is NA when calculation of correlation for g1 and g1, I though 
> it should be 1.



>



> cor.4gene$cor[is.na(cor.4gene$cor)]<-1



> cor.4gene$cor[is.na(cor.4gene$P)]<-0



> cor.4gene.sq <-pivot_wider(cor.4gene, names_from = gene1, values_from = cor)



>



> Then this line of code above did not generate a square matrix as what the 
> HMisc library did.



> How to fix my code?



>



> Thank you,



>



> Ding



>



>



> --



> 



> -SECURITY/CONFIDENTIALITY WARNING-



>



> This message and any attachments are intended solely for the individual or 
> entity to which they are addressed. This communication may contain 
> information that is privileged, confidential, or exempt from disclosure under 
> applicable law (e.g., personal health information, research data, financial 
> information). Because this e-mail has been sent without encryption, 
> individuals other than the intended recipient may be able to view the 
> information, forward it to others or tamper with the information without the 
> knowledge or consent of the sender. If you are not the intended recipient, or 
> the employee or person responsible for delivering the message to the intended 
> recipient, any dissemination, distribution or copying of the communicat

Re: [R] please help generate a square correlation matrix

2024-07-27 Thread Yuan Chun Ding via R-help
>> in each column, less than 10% values are 1, most of them are 0;

> > >

> > >

> > >

> > >> so I want to remove a  row with value of zero in both columns when 
> > >> calculate correlation between two columns.

> > >

>

> So we're talking about correlations between binary variables.

> Suppose we have two 0-1-valued variables, x and y.

> Let A <- sum(x*y)  # number of cases where x and y are both 1.

> Let B <- sum(x)-a  # number of cases where x is 1 and y is 0

> Let C <- sum(y)-a # number of cases where y is 1 and x is 0

> Let D <- sum(!x * !y) # number of cases where x and y are both 0.

>

> N

>

> On Fri, 26 Jul 2024 at 12:07, Bert Gunter 
> mailto:bgunter.4...@gmail.com>> wrote:

> >

> > If I have understood the request, I'm not sure that omitting all 0

> > pairs for each pair of columns makes much sense, but be that as it

> > may, here's another way to do it by using the 'FUN' argument of combn

> > to encapsulate any calculations that you do. I just use cor() as the

> > calculation -- you can use anything you like that takes two vectors of

> > 0's and 1's and produces fixed length numeric results (or fromm which

> > you can extract such).

> >

> > I encapsulated it all in a little function. Note that I first

> > converted the data frame to a matrix. Because of their generality,

> > data frames carry a lot of extra baggage that can slow purely numeric

> > manipulations down.

> >

> > Anyway, here's the function, 'somecors' (I'm a bad name picker :(  ! )

> >

> >somecors <- function(dat, func = cor){

> >   dat <- as.matrix(dat)

> >   indx <- seq_len(ncol(dat))

> >  combn(indx, 2, FUN = \(z) {

> > i <- z[1]; j <- z[2]

> > k <- dat[, i ] | dat[, j ]

> > c(z,func(dat[k,i ], dat[k,j ]))

> >  })

> >}

> >

> > Results come out as a matrix with combn(ncol(dat),2) columns, the

> > first 2 rows giving the pair of column numbers for each column,and

> > then 1 or more rows (possibly extracted) from whatever func you use.

> > Here's the results for your data formatted to 2 decimal places:

> >

> > > round(somecors(dat),2)

> >  [,1]  [,2]  [,3]  [,4] [,5]  [,6]

> > [1,]  1.0  1.00  1.00  2.002  3.00

> > [2,]  2.0  3.00  4.00  3.004  4.00

> > [3,] -0.5 -0.41 -0.35 -0.41   NA -0.47

> > Warning message:

> > In func(dat[k, i], dat[k, j]) : the standard deviation is zero

> >

> > The NA and warning comes in the 2,4 pair of columns because after

> > removing all zero rows in the pair, dat[,4] is all 1's, giving a zero

> > in the denominator of the cor() calculation -- again, assuming I have

> > correctly understood your request. If so, this might be something you

> > need to worry about.

> >

> > Again, feel free to ignore if  I have misinterpreterd or this does not suit.

> >

> > Cheers,

> > Bert

> >

> >

> > On Thu, Jul 25, 2024 at 2:01 PM Rui Barradas 
> > mailto:ruipbarra...@sapo.pt>> wrote:

> > >

> > > Às 20:47 de 25/07/2024, Yuan Chun Ding escreveu:

> > > > Hi Rui,

> > > >

> > > > You are always very helpful!! Thank you,

> > > >

> > > > I just modified your R codes to remove a row with zero values in both 
> > > > column pair as below for my real data.

> > > >

> > > > Ding

> > > >

> > > > dat<-gene22mut.coded

> > > > r <- P <- matrix(NA, nrow = 22L, ncol = 22L,

> > > >   dimnames = list(names(dat), names(dat)))

> > > >

> > > > for(i in 1:22) {

> > > >#i=1

> > > >x <- dat[[i]]

> > > >for(j in (1:22)) {

> > > >  #j=2

> > > >  if(i == j) {

> > > ># there's nothing to test, assign correlation 1

> > > >r[i, j] <- 1

> > > >  } else {

> > > >tmp <-cbind(x,dat[[j]])

> > > >row0 <-rowSums(tmp)

> > > >tem2 <-tmp[row0!=0,]

> > > >tmp3 <- cor.test(tem2[,1],tem2[,2])

> > > >r[i, j] <- tmp3$estimate

> > > >P[i, j] <- tmp3$p.value

> > > >  }

> > > >}

> > > > }

> > > > r<-as.data.frame(r)

> > > 

Re: [R] please help generate a square correlation matrix

2024-07-27 Thread Yuan Chun Ding via R-help
and x is 0

>

> > Let D <- sum(!x * !y) # number of cases where x and y are both 0.

>

> >

>

> > N

>

> >

>

> > On Fri, 26 Jul 2024 at 12:07, Bert Gunter 
> > mailto:bgunter.4...@gmail.com>> wrote:

>

> > >

>

> > > If I have understood the request, I'm not sure that omitting all 0

>

> > > pairs for each pair of columns makes much sense, but be that as it

>

> > > may, here's another way to do it by using the 'FUN' argument of combn

>

> > > to encapsulate any calculations that you do. I just use cor() as the

>

> > > calculation -- you can use anything you like that takes two vectors of

>

> > > 0's and 1's and produces fixed length numeric results (or fromm which

>

> > > you can extract such).

>

> > >

>

> > > I encapsulated it all in a little function. Note that I first

>

> > > converted the data frame to a matrix. Because of their generality,

>

> > > data frames carry a lot of extra baggage that can slow purely numeric

>

> > > manipulations down.

>

> > >

>

> > > Anyway, here's the function, 'somecors' (I'm a bad name picker :(  ! )

>

> > >

>

> > >somecors <- function(dat, func = cor){

>

> > >   dat <- as.matrix(dat)

>

> > >   indx <- seq_len(ncol(dat))

>

> > >  combn(indx, 2, FUN = \(z) {

>

> > > i <- z[1]; j <- z[2]

>

> > > k <- dat[, i ] | dat[, j ]

>

> > > c(z,func(dat[k,i ], dat[k,j ]))

>

> > >  })

>

> > >}

>

> > >

>

> > > Results come out as a matrix with combn(ncol(dat),2) columns, the

>

> > > first 2 rows giving the pair of column numbers for each column,and

>

> > > then 1 or more rows (possibly extracted) from whatever func you use.

>

> > > Here's the results for your data formatted to 2 decimal places:

>

> > >

>

> > > > round(somecors(dat),2)

>

> > >  [,1]  [,2]  [,3]  [,4] [,5]  [,6]

>

> > > [1,]  1.0  1.00  1.00  2.002  3.00

>

> > > [2,]  2.0  3.00  4.00  3.004  4.00

>

> > > [3,] -0.5 -0.41 -0.35 -0.41   NA -0.47

>

> > > Warning message:

>

> > > In func(dat[k, i], dat[k, j]) : the standard deviation is zero

>

> > >

>

> > > The NA and warning comes in the 2,4 pair of columns because after

>

> > > removing all zero rows in the pair, dat[,4] is all 1's, giving a zero

>

> > > in the denominator of the cor() calculation -- again, assuming I have

>

> > > correctly understood your request. If so, this might be something you

>

> > > need to worry about.

>

> > >

>

> > > Again, feel free to ignore if  I have misinterpreterd or this does not 
> > > suit.

>

> > >

>

> > > Cheers,

>

> > > Bert

>

> > >

>

> > >

>

> > > On Thu, Jul 25, 2024 at 2:01 PM Rui Barradas 
> > > mailto:ruipbarra...@sapo.pt>> wrote:

>

> > > >

>

> > > > Às 20:47 de 25/07/2024, Yuan Chun Ding escreveu:

>

> > > > > Hi Rui,

>

> > > > >

>

> > > > > You are always very helpful!! Thank you,

>

> > > > >

>

> > > > > I just modified your R codes to remove a row with zero values in both 
> > > > > column pair as below for my real data.

>

> > > > >

>

> > > > > Ding

>

> > > > >

>

> > > > > dat<-gene22mut.coded

>

> > > > > r <- P <- matrix(NA, nrow = 22L, ncol = 22L,

>

> > > > >   dimnames = list(names(dat), names(dat)))

>

> > > > >

>

> > > > > for(i in 1:22) {

>

> > > > >#i=1

>

> > > > >x <- dat[[i]]

>

> > > > >for(j in (1:22)) {

>

> > > > >  #j=2

>

> > > > >  if(i == j) {

>

> > > > ># there's nothing to test, assign correlation 1

>

> > > > >r[i, j] <- 1

>

> > > > >  } else {

>

> > > > >tmp <-cbind(x,dat[[j]])

>

> > > > >row0 <-rowSums(tmp)

>

> > > > >    tem2 <-tmp[row0!

Re: [R] a fast way to do my job

2024-08-09 Thread Yuan Chun Ding via R-help
Dear R users,

I am running the following code below,  the gem751be.rpkm is a dataframe with 
dim of 751 samples by 35164 variables,  73 phenotypic variables in the furst to 
73rd column and 35091 genomic variables or genes in the 74th to 35164th 
columns.  What I need to do is to calculate the residuals for each gene using 
the simple linear regression model of genelist[i] ~ purity2;

The following code is running,  it takes long time, but I have an expensive 
ThinkStation window computer.
Can you provide a fast way to do it?

Thank you,

Ding

-


gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)),
+   by.x="id2",by.y=0)
>   row.names(gem751be.rpkm)<-gem751be.rpkm$id3
>   
> colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_")
>   genelist <- gem751be.rpkm %>% dplyr::select(74:35164)
>   residuals <- NULL
>   for (i in 1:length(genelist)) {
+ #i=1
+ formula <- reformulate("purity2", response=names(genelist)[i])
+ model <- lm(formula, data = gem751be.rpkm)
+ resi <- as.data.frame(residuals(model))
+ colnames(resi)[1]<-names(genelist)[i]
+ resi <-as.data.frame(t(resi))
+ residuals <- rbind(residuals, resi)
+   }



--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to rec
 eive further communications via e-mail, please reply to this message and 
inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a fast way to do my job

2024-08-10 Thread Yuan Chun Ding via R-help
HI Bert and Ben,

Yes, running lm.fit using the matrix format is much faster. I read a couple of 
online comments why it is faster.

However, the residual values for three tested variables or genes from lm 
function and lm.fit function are different, with Pearson correlation of 0.55, 
0.89, and 0.99.

I have not found the reason.

Thanks,

Ding

From: Bert Gunter 
Sent: Friday, August 9, 2024 7:11 PM
To: Ben Bolker 
Cc: Yuan Chun Ding ; r-help@r-project.org
Subject: Re: [R] a fast way to do my job

Better idea, Ben! It would work as you might expect it to to produce the same 
results as the above: ##first make sure your regressor is a matrix: pur2 <- 
matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat 
<-


Better idea, Ben!



It would work as you might expect it to to produce the same results as

the above:



##first make sure your regressor is a matrix:

pur2 <- matrix(purity2, ncol =1)

## convert the data frame variables into a matrix

dat <- as.matrix(gem751be.rpkm[ , 74:35164])

##then

result <- residuals(lm.fit( x= pur2, y = dat))



Cheers,

Bert



On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker 
mailto:bbol...@gmail.com>> wrote:

>

> You can also fit a linear model with a matrix-valued response

> variable, which should be even faster (not sure off the top of my head

> how to get the residuals and reshape them to the dimensions you want)

>

> On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter 
> mailto:bgunter.4...@gmail.com>> wrote:

> >

> > See ?lm.fit.

> > I must be missing something, because:

> >

> > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2,

> > gem751be.rpkm[, i] )))

> >

> > would give you a 751 x 35091 matrix of the residuals from each of the

> > regressions.

> > I assume it will be considerably faster than all the overhead you are

> > carrying in your current code, but of course you'll have to try it and

> > see. ... Assuming that I have interpreted your request correctly.

> > Ignore if not.

> >

> > Cheers,

> > Bert

> >

> > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help

> > mailto:r-help@r-project.org>> wrote:

> > >

> > > Dear R users,

> > >

> > > I am running the following code below,  the gem751be.rpkm is a dataframe 
> > > with dim of 751 samples by 35164 variables,  73 phenotypic variables in 
> > > the furst to 73rd column and 35091 genomic variables or genes in the 74th 
> > > to 35164th columns.  What I need to do is to calculate the residuals for 
> > > each gene using the simple linear regression model of genelist[i] ~ 
> > > purity2;

> > >

> > > The following code is running,  it takes long time, but I have an 
> > > expensive ThinkStation window computer.

> > > Can you provide a fast way to do it?

> > >

> > > Thank you,

> > >

> > > Ding

> > >

> > > -

> > >

> > >

> > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)),

> > > +   by.x="id2",by.y=0)

> > > >   row.names(gem751be.rpkm)<-gem751be.rpkm$id3

> > > >   
> > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_")

> > > >   genelist <- gem751be.rpkm %>% dplyr::select(74:35164)

> > > >   residuals <- NULL

> > > >   for (i in 1:length(genelist)) {

> > > + #i=1

> > > + formula <- reformulate("purity2", response=names(genelist)[i])

> > > + model <- lm(formula, data = gem751be.rpkm)

> > > + resi <- as.data.frame(residuals(model))

> > > + colnames(resi)[1]<-names(genelist)[i]

> > > + resi <-as.data.frame(t(resi))

> > > + residuals <- rbind(residuals, resi)

> > > +   }

> > >

> > >

> > >

> > > --

> > > 

> > > -SECURITY/CONFIDENTIALITY WARNING-

> > >

> > > This message and any attachments are intended solely for the individual 
> > > or entity to which they are addressed. This communication may contain 
> > > information that is privileged, confidential, or exempt from disclosure 
> > > under applicable law (e.g., personal health information, research data, 
> > > financial information). Because this e-mail has been sent withou

Re: [R] a fast way to do my job

2024-08-10 Thread Yuan Chun Ding via R-help
after add intercept, all residuals are the same from lm or lm.fit.

Ding

From: Bert Gunter 
Sent: Saturday, August 10, 2024 1:00 PM
To: Yuan Chun Ding 
Cc: Ben Bolker ; r-help@r-project.org
Subject: Re: [R] a fast way to do my job

Probably because you inadvertently ran different models. Without your code, I 
haven't a clue. On Sat, Aug 10, 2024, 12: 29 Yuan Chun Ding  
wrote: HI Bert and Ben, Yes, running lm. fit using the matrix format is much 
faster. 


Probably because you inadvertently ran different models. Without your code, I 
haven't a clue.

On Sat, Aug 10, 2024, 12:29 Yuan Chun Ding 
mailto:ycd...@coh.org>> wrote:
HI Bert and Ben,

Yes, running lm.fit using the matrix format is much faster. I read a couple of 
online comments why it is faster.

However, the residual values for three tested variables or genes from lm 
function and lm.fit function are different, with Pearson correlation of 0.55, 
0.89, and 0.99.

I have not found the reason.

Thanks,

Ding

From: Bert Gunter mailto:bgunter.4...@gmail.com>>
Sent: Friday, August 9, 2024 7:11 PM
To: Ben Bolker mailto:bbol...@gmail.com>>
Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; 
r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R] a fast way to do my job

Better idea, Ben! It would work as you might expect it to to produce the same 
results as the above: ##first make sure your regressor is a matrix: pur2 <- 
matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat 
<-

Better idea, Ben!



It would work as you might expect it to to produce the same results as

the above:



##first make sure your regressor is a matrix:

pur2 <- matrix(purity2, ncol =1)

## convert the data frame variables into a matrix

dat <- as.matrix(gem751be.rpkm[ , 74:35164])

##then

result <- residuals(lm.fit( x= pur2, y = dat))



Cheers,

Bert



On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker 
mailto:bbol...@gmail.com>> wrote:

>

> You can also fit a linear model with a matrix-valued response

> variable, which should be even faster (not sure off the top of my head

> how to get the residuals and reshape them to the dimensions you want)

>

> On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter 
> mailto:bgunter.4...@gmail.com>> wrote:

> >

> > See ?lm.fit.

> > I must be missing something, because:

> >

> > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2,

> > gem751be.rpkm[, i] )))

> >

> > would give you a 751 x 35091 matrix of the residuals from each of the

> > regressions.

> > I assume it will be considerably faster than all the overhead you are

> > carrying in your current code, but of course you'll have to try it and

> > see. ... Assuming that I have interpreted your request correctly.

> > Ignore if not.

> >

> > Cheers,

> > Bert

> >

> > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help

> > mailto:r-help@r-project.org>> wrote:

> > >

> > > Dear R users,

> > >

> > > I am running the following code below,  the gem751be.rpkm is a dataframe 
> > > with dim of 751 samples by 35164 variables,  73 phenotypic variables in 
> > > the furst to 73rd column and 35091 genomic variables or genes in the 74th 
> > > to 35164th columns.  What I need to do is to calculate the residuals for 
> > > each gene using the simple linear regression model of genelist[i] ~ 
> > > purity2;

> > >

> > > The following code is running,  it takes long time, but I have an 
> > > expensive ThinkStation window computer.

> > > Can you provide a fast way to do it?

> > >

> > > Thank you,

> > >

> > > Ding

> > >

> > > -

> > >

> > >

> > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)),

> > > +   by.x="id2",by.y=0)

> > > >   row.names(gem751be.rpkm)<-gem751be.rpkm$id3

> > > >   
> > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacement="_")

> > > >   genelist <- gem751be.rpkm %>% dplyr::select(74:35164)

> > > >   residuals <- NULL

> > > >   for (i in 1:length(genelist)) {

> > > + #i=1

> > > + formula <- reformulate("purity2", response=names(genelist)[i])

> > > + model <- lm(formula, data = gem751be.rpkm)

> > > + resi <- as.data.frame(residuals(model))

> > > + colnames(resi)[1]<-names(genelist)[i]

> > > + resi <-as.data.frame(t(resi))

> > >

Re: [R] a fast way to do my job

2024-08-10 Thread Yuan Chun Ding via R-help
You are right.  I also just thought about that, no intercept is not applicable 
to my case.

Ding

From: Bert Gunter 
Sent: Saturday, August 10, 2024 1:06 PM
To: Yuan Chun Ding 
Cc: Ben Bolker ; r-help@r-project.org
Subject: Re: [R] a fast way to do my job

Ah, messages crossed. A no-intercept model **assumes** the straight line fit 
must pass through the origin. Unless there is a strong justification for such 
an assumption, you should include an intercept. -- Bert On Sat, Aug 10, 2024 at 
1: 02 PM


Ah, messages crossed.

A no-intercept model **assumes** the straight line fit must pass

through the origin. Unless there is a strong justification for such an

assumption, you should include an intercept.



-- Bert



On Sat, Aug 10, 2024 at 1:02 PM Bert Gunter 
mailto:bgunter.4...@gmail.com>> wrote:

>

> Is it because I failed to to add a column of ones for an intercept to

> the x matrix? TRhat would be my bad.

>

> -- Bert

>

>

> On Sat, Aug 10, 2024 at 12:59 PM Bert Gunter 
> mailto:bgunter.4...@gmail.com>> wrote:

> >

> > Probably because you inadvertently ran different models. Without your code, 
> > I haven't a clue.

> >

> >

> > On Sat, Aug 10, 2024, 12:29 Yuan Chun Ding 
> > mailto:ycd...@coh.org>> wrote:

> >>

> >> HI Bert and Ben,

> >>

> >>

> >>

> >> Yes, running lm.fit using the matrix format is much faster. I read a 
> >> couple of online comments why it is faster.

> >>

> >>

> >>

> >> However, the residual values for three tested variables or genes from lm 
> >> function and lm.fit function are different, with Pearson correlation of 
> >> 0.55, 0.89, and 0.99.

> >>

> >>

> >>

> >> I have not found the reason.

> >>

> >>

> >>

> >> Thanks,

> >>

> >>

> >> Ding

> >>

> >>

> >>

> >> From: Bert Gunter mailto:bgunter.4...@gmail.com>>

> >> Sent: Friday, August 9, 2024 7:11 PM

> >> To: Ben Bolker mailto:bbol...@gmail.com>>

> >> Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; 
> >> r-help@r-project.org<mailto:r-help@r-project.org>

> >> Subject: Re: [R] a fast way to do my job

> >>

> >>

> >>

> >> Better idea, Ben! It would work as you might expect it to to produce the 
> >> same results as the above: ##first make sure your regressor is a matrix: 
> >> pur2 <- matrix(purity2, ncol =1) ## convert the data frame variables into 
> >> a matrix dat <-

> >>

> >> Better idea, Ben!

> >>

> >>

> >>

> >> It would work as you might expect it to to produce the same results as

> >>

> >> the above:

> >>

> >>

> >>

> >> ##first make sure your regressor is a matrix:

> >>

> >> pur2 <- matrix(purity2, ncol =1)

> >>

> >> ## convert the data frame variables into a matrix

> >>

> >> dat <- as.matrix(gem751be.rpkm[ , 74:35164])

> >>

> >> ##then

> >>

> >> result <- residuals(lm.fit( x= pur2, y = dat))

> >>

> >>

> >>

> >> Cheers,

> >>

> >> Bert

> >>

> >>

> >>

> >> On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker 
> >> mailto:bbol...@gmail.com>> wrote:

> >>

> >> >

> >>

> >> > You can also fit a linear model with a matrix-valued response

> >>

> >> > variable, which should be even faster (not sure off the top of my head

> >>

> >> > how to get the residuals and reshape them to the dimensions you want)

> >>

> >> >

> >>

> >> > On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter 
> >> > mailto:bgunter.4...@gmail.com>> wrote:

> >>

> >> > >

> >>

> >> > > See ?lm.fit.

> >>

> >> > > I must be missing something, because:

> >>

> >> > >

> >>

> >> > > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2,

> >>

> >> > > gem751be.rpkm[, i] )))

> >>

> >> > >

> >>

> >> > > would give you a 751 x 35091 matrix of the residuals from each of the

> >>

> >> > > regressions.

> >>

> >> > > I assume it will be considerably faster than all the overhead you are

> >>

> >> >

Re: [R] a fast way to do my job

2024-08-10 Thread Yuan Chun Ding via R-help
Hi Bert and Ben,

Thanks a lot for your suggestion!!.

About the different residuals between lm function and lm.fit,  from online 
search,  lt seems like that I need to add an intercept in the design matrix x;

pur2 <- matrix(gem751be.rpkm$purity2, ncol =1)
  pur2.1 <- cbind(1,gem751be.rpkm$purity2)

then running result2 <- residuals(lm.fit( x= pur2.1, y = dat));
now I am thinking whether an intercept is required or not.

Ding
From: R-help  On Behalf Of Yuan Chun Ding via 
R-help
Sent: Saturday, August 10, 2024 12:30 PM
To: Bert Gunter ; Ben Bolker 
Cc: r-help@r-project.org
Subject: Re: [R] a fast way to do my job

HI Bert and Ben, Yes, running lm. fit using the matrix format is much faster. I 
read a couple of online comments why it is faster. However, the residual values 
for three tested variables or genes from lm function and lm. fit function are 
different,


HI Bert and Ben,



Yes, running lm.fit using the matrix format is much faster. I read a couple of 
online comments why it is faster.



However, the residual values for three tested variables or genes from lm 
function and lm.fit function are different, with Pearson correlation of 0.55, 
0.89, and 0.99.



I have not found the reason.



Thanks,



Ding



From: Bert Gunter mailto:bgunter.4...@gmail.com>>

Sent: Friday, August 9, 2024 7:11 PM

To: Ben Bolker mailto:bbol...@gmail.com>>

Cc: Yuan Chun Ding mailto:ycd...@coh.org>>; 
r-help@r-project.org<mailto:r-help@r-project.org>

Subject: Re: [R] a fast way to do my job



Better idea, Ben! It would work as you might expect it to to produce the same 
results as the above: ##first make sure your regressor is a matrix: pur2 <- 
matrix(purity2, ncol =1) ## convert the data frame variables into a matrix dat 
<-





Better idea, Ben!







It would work as you might expect it to to produce the same results as



the above:







##first make sure your regressor is a matrix:



pur2 <- matrix(purity2, ncol =1)



## convert the data frame variables into a matrix



dat <- as.matrix(gem751be.rpkm[ , 74:35164])



##then



result <- residuals(lm.fit( x= pur2, y = dat))







Cheers,



Bert







On Fri, Aug 9, 2024 at 6:38 PM Ben Bolker 
mailto:bbol...@gmail.com<mailto:bbol...@gmail.com%3cmailto:bbol...@gmail.com>>>
 wrote:



>



> You can also fit a linear model with a matrix-valued response



> variable, which should be even faster (not sure off the top of my head



> how to get the residuals and reshape them to the dimensions you want)



>



> On Fri, Aug 9, 2024 at 9:31 PM Bert Gunter 
> mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com%3cmailto:bgunter.4...@gmail.com>>>
>  wrote:



> >



> > See ?lm.fit.



> > I must be missing something, because:



> >



> > results <- sapply(74:35164, \(i) residuals(lm.fit(purity2,



> > gem751be.rpkm[, i] )))



> >



> > would give you a 751 x 35091 matrix of the residuals from each of the



> > regressions.



> > I assume it will be considerably faster than all the overhead you are



> > carrying in your current code, but of course you'll have to try it and



> > see. ... Assuming that I have interpreted your request correctly.



> > Ignore if not.



> >



> > Cheers,



> > Bert



> >



> > On Fri, Aug 9, 2024 at 4:50 PM Yuan Chun Ding via R-help



> > mailto:r-help@r-project.org<mailto:r-help@r-project.org%3cmailto:r-help@r-project.org>>>
> >  wrote:



> > >



> > > Dear R users,



> > >



> > > I am running the following code below,  the gem751be.rpkm is a dataframe 
> > > with dim of 751 samples by 35164 variables,  73 phenotypic variables in 
> > > the furst to 73rd column and 35091 genomic variables or genes in the 74th 
> > > to 35164th columns.  What I need to do is to calculate the residuals for 
> > > each gene using the simple linear regression model of genelist[i] ~ 
> > > purity2;



> > >



> > > The following code is running,  it takes long time, but I have an 
> > > expensive ThinkStation window computer.



> > > Can you provide a fast way to do it?



> > >



> > > Thank you,



> > >



> > > Ding



> > >



> > > -



> > >



> > >



> > > gem751be.rpkm <-merge(gem751be10, as.data.frame(t(rna849.fpkm2)),



> > > +   by.x="id2",by.y=0)



> > > >   row.names(gem751be.rpkm)<-gem751be.rpkm$id3



> > > >   
> > > > colnames(gem751be.rpkm)<-gsub(colnames(gem751be.rpkm),pattern="-",replacemen