Hi John and Bert, Thank you so much for your replies. Both of your scripts worked well, so now I've learnt two ways to do it. :)
Bert: I was not very clear on what I wanted to do. I just would like to calculate the residues shown in the table, not all residues. The *apply*functions * *are amazing! John: as I am still digesting the codes, I am not sure if I fully understood the argument .(variables, value) in the *ddply* line. The description of *ddply* says that .variables show the variables to split data frame by, as quoted variables, a formula or character vector. So does .(variables, value) tell R to split the data frame by values, which are the types of amino acid residues? Thank you all again. Cheers, Zhao 2012/7/24 Bert Gunter <gunter.ber...@gene.com> > ... and I neglected to mention that f = myfiles[,2] > > Sigh.... More coffee needed. > > -- Bert > > On Tue, Jul 24, 2012 at 9:43 AM, Bert Gunter <bgun...@gene.com> wrote: > > Sorry. Typo in my previous. Should be: > > > >> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x,sum))) > > $X1 > > L R T > > 0.91491320 0.03675651 0.04833030 > > > > $X2 > > E M > > 0.9827278 0.0172722 > > > > $X3 > > N Y > > 0.0483303 0.9516697 > > > > $X4 > > I L Q > > 0.8976410 0.0850868 0.0172722 > > > > $X5 > > I V > > 0.9516697 0.0483303 > > > > $X6 > > P S > > 0.96324349 0.03675651 > > > > $X7 > > D E G > > 0.8976410 0.0540287 0.0483303 > > > > $X8 > > A C > > 0.9827278 0.0172722 > > > > > > > > On Tue, Jul 24, 2012 at 9:37 AM, Bert Gunter <bgun...@gene.com> wrote: > >> OK, I admit it: I re-read what you wrote and now I'm confused. Is: > >> > >>> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x))) > >> > >> X1 X2 X3 X4 X5 X6 X7 X8 > >> [1,] 0.1428571 0.2 0.2857143 0.125 0.2 0.2 0.125 0.2 > >> [2,] 0.4285714 0.2 0.1428571 0.250 0.4 0.2 0.375 0.2 > >> [3,] 0.1428571 0.4 0.2857143 0.375 0.2 0.2 0.250 0.4 > >> [4,] 0.2857143 0.2 0.2857143 0.250 0.2 0.4 0.250 0.2 > >> > >> what you want? > >> > >> -- Bert > >> On Tue, Jul 24, 2012 at 9:17 AM, Bert Gunter <bgun...@gene.com> wrote: > >>> The OP's request is a bit ambiguous to me: at a given residue, do you > >>> wish to calculate the proportions for only those amino acids that > >>> appear at that residue, or do you wish to include the proportions for > >>> all amino acids, some of which might then be 0. > >>> > >>> Assuming the former, then I don't think one needs to go to the lengths > >>> described by John below. > >>> > >>> Using your example (thanks!), the following seems to suffice: > >>> > >>>> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x))) > >>> > >>> $X1 > >>> x > >>> L R T > >>> 0.50 0.25 0.25 > >>> > >>> $X2 > >>> x > >>> E M > >>> 0.75 0.25 > >>> > >>> $X3 > >>> x > >>> N Y > >>> 0.25 0.75 > >>> > >>> $X4 > >>> x > >>> I L Q > >>> 0.25 0.50 0.25 > >>> > >>> $X5 > >>> x > >>> I V > >>> 0.75 0.25 > >>> > >>> $X6 > >>> x > >>> P S > >>> 0.75 0.25 > >>> > >>> $X7 > >>> x > >>> D E G > >>> 0.25 0.50 0.25 > >>> > >>> $X8 > >>> x > >>> A C > >>> 0.75 0.25 > >>> > >>> > >>> This could, of course, then be modified to add zero proportions for > >>> all non-appearing amino acids. > >>> > >>> -- Cheers, > >>> Bert > >>> > >>> On Tue, Jul 24, 2012 at 8:18 AM, John Kane <jrkrid...@inbox.com> > wrote: > >>>> > >>>> I think this does what you want using two packages, plyr and > reshape2 that > >>>> you may have to install. If so install.packages("plyr", > "reshape2") should > >>>> do the trick. > >>>> library(plyr) > >>>> library(reshape2) > >>>> # using supplied file 'myfile" from below > >>>> time0total = sum(myfile[,2]) > >>>> mydata <- myfile[, 2:10] > >>>> md1 <- melt(mydata, id = "Time_zero") > >>>> ddply(md1, .(variable, value), summarise, sum = > sum(Time_zero)/time0total) > >>>> > >>>> > >>>> John Kane > >>>> Kingston ON Canada > >>>> > >>>> -----Original Message----- > >>>> From: z...@cornell.edu > >>>> Sent: Tue, 24 Jul 2012 10:25:21 -0400 > >>>> To: jrkrid...@inbox.com > >>>> Subject: Re: [R] How to do the same thing for all levels of a > column? > >>>> > >>>> Hi John, > >>>> Thank you for the tips. My apologies about the unreadable sample > data... > >>>> So here is the output of the sample data, and hopefully it works > this time > >>>> :) > >>>> myfile <- structure(list(Proteins = structure(1:4, .Label = > c("p1", "p2", > >>>> "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731, > >>>> 9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label = > c("L", > >>>> "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L > >>>> ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L, > >>>> 1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"), X4 = > structure(c(1L, > >>>> 2L, 3L, 2L), .Label = c("I", "L", "Q"), class = "factor"), > X5 = > >>>> structure(c(1L, > >>>> 2L, 1L, 1L), .Label = c("I", "V"), class = "factor"), X6 = > structure(c(1L, > >>>> 1L, 1L, 2L), .Label = c("P", "S"), class = "factor"), X7 = > structure(c(1L, > >>>> 3L, 2L, 2L), .Label = c("D", "E", "G"), class = "factor"), > X8 = > >>>> structure(c(1L, > >>>> 1L, 2L, 1L), .Label = c("A", "C"), class = "factor")), > .Names = > >>>> c("Proteins", > >>>> "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"), > row.names = > >>>> c(NA, > >>>> 4L), class = "data.frame") > >>>> And here is my original question: > >>>> Basically, I have a bunch of protein sequences composed of > different amino > >>>> acid residues, and each residue is represented by an uppercase > letter. I > >>>> want to calculate the ratio of different amino acid residues at > each > >>>> position of the proteins. > >>>> > >>>> If I name this table as myfile.txt, I have the following > scripts to > >>>> calculate the ratio of each amino acid residue at position 1: > >>>> > >>>> # showing levels of the 3rd column, which means the types of > residues > >>>> > >>>> >myfile[,3] > >>>> > >>>> > >>>> # calculating the ratio of L > >>>> > >>>> >list=c(which(myfile[,3]=="L")) > >>>> > >>>> >time0total=sum(myfile[,2]) > >>>> > >>>> >AA_L=0 > >>>> > >>>> >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)} > >>>> > >>>> >ratio_L=AA_L/time0total > >>>> > >>>> > >>>> So how can I write a script to do the same thing for the other two > levels (T > >>>> and R) in column 3, and also do this for every column that > contains amino > >>>> acid residues? > >>>> > >>>> Thanks a lot! > >>>> > >>>> Regards, > >>>> > >>>> Zhao > >>>> 2012/7/24 John Kane <[1]jrkrid...@inbox.com> > >>>> > >>>> First thing is to supply the data in a useable format. As is it > is > >>>> essenatially unreadable. All R-beginners do this. :) > >>>> Have a look at the dput function (?dput) for a good way to > supply sample > >>>> data in an email. > >>>> If you have a large dataset probably a few dozen lines of data > would be > >>>> fine. > >>>> Something like dput(head(mydata)) should be fine. Just copy and > paste the > >>>> output into your email. > >>>> Welcome to R. I think you will like it. > >>>> John Kane > >>>> Kingston ON Canada > >>>> > >>>> > -----Original Message----- > >>>> > From: [2]z...@cornell.edu > >>>> > Sent: Mon, 23 Jul 2012 18:01:11 -0400 > >>>> > To: [3]r-help@r-project.org > >>>> > Subject: [R] How to do the same thing for all levels of a column? > >>>> > > >>>> > Dear all, > >>>> > > >>>> > > >>>> > > >>>> > I am a R beginner, and I am looking for a way to do the same > thing for > >>>> > all > >>>> > levels of a column in a table. > >>>> > > >>>> > > >>>> > > >>>> > Basically, I have a bunch of protein sequences composed of > different > >>>> > amino > >>>> > acid residues, and each residue is represented by an uppercase > letter. I > >>>> > want to calculate the ratio of different amino acid residues at > each > >>>> > position of the proteins. Here is an example table: > >>>> > > >>>> > Proteins > >>>> > > >>>> > Time_zero > >>>> > > >>>> > 1 > >>>> > > >>>> > 2 > >>>> > > >>>> > 3 > >>>> > > >>>> > 4 > >>>> > > >>>> > 5 > >>>> > > >>>> > 6 > >>>> > > >>>> > 7 > >>>> > > >>>> > 8 > >>>> > > >>>> > p1 > >>>> > > >>>> > 0.0050723 > >>>> > > >>>> > L > >>>> > > >>>> > E > >>>> > > >>>> > Y > >>>> > > >>>> > I > >>>> > > >>>> > I > >>>> > > >>>> > P > >>>> > > >>>> > D > >>>> > > >>>> > A > >>>> > > >>>> > p2 > >>>> > > >>>> > 0.0002731 > >>>> > > >>>> > T > >>>> > > >>>> > E > >>>> > > >>>> > N > >>>> > > >>>> > L > >>>> > > >>>> > V > >>>> > > >>>> > P > >>>> > > >>>> > G > >>>> > > >>>> > A > >>>> > > >>>> > p3 > >>>> > > >>>> > 9.757E-05 > >>>> > > >>>> > L > >>>> > > >>>> > M > >>>> > > >>>> > Y > >>>> > > >>>> > Q > >>>> > > >>>> > I > >>>> > > >>>> > P > >>>> > > >>>> > E > >>>> > > >>>> > C > >>>> > > >>>> > p4 > >>>> > > >>>> > 0.0002077 > >>>> > > >>>> > R > >>>> > > >>>> > E > >>>> > > >>>> > Y > >>>> > > >>>> > L > >>>> > > >>>> > I > >>>> > > >>>> > S > >>>> > > >>>> > E > >>>> > > >>>> > A > >>>> > > >>>> > > >>>> > > >>>> > If I name this table as myfile.txt, I have the following scripts > to > >>>> > calculate the ratio of each amino acid residue at position 1: > >>>> > > >>>> > # showing levels of the 3rd column, which means the types of > residues > >>>> > > >>>> > >myfile[,3] > >>>> > > >>>> > > >>>> > > >>>> > # calculating the ratio of L > >>>> > > >>>> > >list=c(which(myfile[,3]=="L")) > >>>> > > >>>> > >time0total=sum(myfile[,2]) > >>>> > > >>>> > >AA_L=0 > >>>> > > >>>> > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)} > >>>> > > >>>> > >ratio_L=AA_L/time0total > >>>> > > >>>> > > >>>> > > >>>> > So how can I write a script to do the same thing for the other > two levels > >>>> > (T and R) in column 3, and also do this for every column that > contains > >>>> > amino acid residues? > >>>> > > >>>> > > >>>> > > >>>> > Many thanks for any help you could give me on this topic! :) > >>>> > > >>>> > > >>>> > > >>>> > Regards, > >>>> > > >>>> > Zhao > >>>> > -- > >>>> > Zhao JIN > >>>> > Ph.D. Candidate > >>>> > Ruth Ley Lab > >>>> > 467 Biotech > >>>> > Field of Microbiology, Cornell University > >>>> > Lab: 607.255.4954 > >>>> > Cell: 412.889.3675 > >>>> > > >>>> > >>>> > [[alternative HTML version deleted]] > >>>> > > >>>> > ______________________________________________ > >>>> > [4]R-help@r-project.org mailing list > >>>> > [5]https://stat.ethz.ch/mailman/listinfo/r-help > >>>> > PLEASE do read the posting guide > >>>> > [6]http://www.R-project.org/posting-guide.html > >>>> > and provide commented, minimal, self-contained, reproducible > code. > >>>> ____________________________________________________________ > >>>> FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & > orcas on > >>>> your desktop! > >>>> Check it out at [7]http://www.inbox.com/marineaquarium > >>>> > >>>> -- > >>>> Zhao JIN > >>>> Ph.D. Candidate > >>>> Ruth Ley Lab > >>>> 467 Biotech > >>>> Field of Microbiology, Cornell University > >>>> Lab: 607.255.4954 > >>>> Cell: 412.889.3675 > >>>> _________________________________________________________________ > >>>> > >>>> [8]3D Earth Screensaver Preview > >>>> Free 3D Earth Screensaver > >>>> Watch the Earth right on your desktop! Check it out > at > >>>> [9]www.inbox.com/earth > >>>> > >>>> References > >>>> > >>>> 1. mailto:jrkrid...@inbox.com > >>>> 2. mailto:z...@cornell.edu > >>>> 3. mailto:r-help@r-project.org > >>>> 4. mailto:R-help@r-project.org > >>>> 5. https://stat.ethz.ch/mailman/listinfo/r-help > >>>> 6. http://www.R-project.org/posting-guide.html > >>>> 7. http://www.inbox.com/marineaquarium > >>>> 8. http://www.inbox.com/earth > >>>> 9. http://www.inbox.com/earth > >>>> ______________________________________________ > >>>> R-help@r-project.org mailing list > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > >>> > >>> -- > >>> > >>> Bert Gunter > >>> Genentech Nonclinical Biostatistics > >>> > >>> Internal Contact Info: > >>> Phone: 467-7374 > >>> Website: > >>> > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > >> > >> > >> > >> -- > >> > >> Bert Gunter > >> Genentech Nonclinical Biostatistics > >> > >> Internal Contact Info: > >> Phone: 467-7374 > >> Website: > >> > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > > > > > > > > -- > > > > Bert Gunter > > Genentech Nonclinical Biostatistics > > > > Internal Contact Info: > > Phone: 467-7374 > > Website: > > > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > > > > -- > > Bert Gunter > Genentech Nonclinical Biostatistics > > Internal Contact Info: > Phone: 467-7374 > Website: > > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm > -- Zhao JIN Ph.D. Candidate Ruth Ley Lab 467 Biotech Field of Microbiology, Cornell University Lab: 607.255.4954 Cell: 412.889.3675 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.