Ben, I think you would find lists a helpful way to arrange your data. They do not require equal lengths of data in each element. Check out the code below for a smaller version of the example you provided (with only 5 individuals rather than 500).
# An alternative way to arrange your data, as a list # Each element of the list is an individual, with all its effector genes ID.unique <- formatC(0001:0005, width=4, flag=0) No_of_Effectors <- sample(1:550, length(ID.unique), replace=TRUE) Effectors <- split(sample(1:10000, sum(No_of_Effectors), replace=TRUE), rep(ID.unique, No_of_Effectors)) Effectors # Now take a random sample of effectors from each individual Expressed_Genes <- lapply(Effectors, function(x) sample(x, sample(1:length(x), 1))) Expressed_Genes Jean "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 10:00:57 AM: > > Hi, > > First my apologies for a non-working piece of code in a previous > submission, I have corrected this error. > > I'm doing is individual based modelling of a pathogen and it's host. > The way I've thought of doing this is with two dataframes, one of > the pathogen and it's genes and effector genes, and one of the host > and it's resistance genes. During the simulation, these things can > be pulled out of the dataframes and operated on, before being stored > again in the dataframes. > > Below is how I've created my dataframe and stored my effector genes. > In this model, effector genes are numerical values between 1 and 10000. > > Path_Number <- 0500 > inds <- data.frame(ID=formatC > (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="") > inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow > (inds),function(x) runif(1, min=1, max=550)))) > Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds > $No_of_Effectors,replace=TRUE)) > inds <- data.frame(inds,Effectors=as.character(Effectors)) > Ind_Genes <- strsplit(as.character(inds[1,4]),",") > > What I'm trying to do is: > 1). For each individual (row) in my database, extract the values in > the "Effectors" cell to an object. > 2). Sample a number of those values and assign them to a new object > called "Expressed_Effectors" > 3). Storing it in the Expressed_Effectors cell, in much the same > manner as I stored the Effectors object in the "Effectors" cell. > > My example attempt (for the first row/individual in my dataset) is below: > > (step by step, I didn't put this in a loop until I know it works for 1 row) > > Extract the values (effector genes) for the first individual, from > the Effectors Cell in the dataframe, to "Ind_Effectors" object. > Ind_Effectors <- strsplit(as.character(inds[1,4]),",") > > Randomly dictate how many values (effectors) will be sampled > n<-round(runif(1, min=10, max=50)) > > Sample n values (effector genes) from "Ind_Effectors", not replacing > Expressed_Genes <- sample(Ind_Effectors,n,replace=F) > > If I run this I receive the error: > Error in sample(Ind_Effectors, n, replace = F) : > cannot take a sample larger than the population when 'replace = FALSE' > > What I think this means is rather than picking out n values from the > whole set of values in "Ind_Effectors" it's trying to sample the > whole lot n times, which it cannot do because replace=F. This is not > what I need, what I need is n values sampled from "Ind_Effectors", > not all values from Ind_Effectors sampled n times. > > I hope this clears up the confusion with what I'm trying to do. It > may very well be I'm not instructing R to sample as a require > properly. Sadly my previous experience with R amounts to loading in > dataframes from experiment and doing stat analysis & model fitting, > not simulations or individual based models. > > Best wishes, > > Ben W. > UEA (ENV) & The Sainsbury Laboratory. > > P.S. As an aside I've been thinking about doing this model an > alternative way to as I described in the first bit of my email > (based on dataframes). > Instead I would use a multi-dimentional ragged array(s): > The format would be a 2D layout, Where every line is an effector > gene and every column an aspect of the effector gene(value, > expression state, fitness contribution etc.) This 2D layout of rows > and columns is then repeated in the 3rd dimension (the z of x,y,z) > of the array for each individual. It is ragged in the sense each > individual, each slice through the array in the z direction, would > have different numbers of rows - different numbers of effectors. > This may be easier to work on, but I've not worked with > multidimensional arrays, I'm used to data in dataframes (usually > from spreadsheets from experiments). > > From: Jean V Adams [jvad...@usgs.gov] > Sent: 08 November 2012 13:35 > To: Benjamin Ward (ENV) > Cc: r-help@r-project.org > Subject: RE: [R] sample from list > Ben, > > You have still not supplied reproducible code for me (and any other > r-help reader) to run, which makes it very difficult to help you. I > can run your first 5 lines of code with no problem. > > Path_Number <- 0500 > inds <-data.frame(ID=formatC > (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="") > inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow > (inds),function(x) runif(1, min=1, max=550)))) > Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds > $No_of_Effectors,replace=TRUE)) > inds <- data.frame(inds,Effectors=as.character(Effectors)) > > But your 6th line of code doesn't work ... there is no object inds2. > > Ind_Genes<-strsplit(as.character(inds2[1,4]),",") > > If I use code that you provided in your earlier e-mail to create > inds2, I get errors because inds doesn't have a variable No_of_Genes. > > Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds > $No_of_Genes,replace=TRUE)) > inds2 <- data.frame(inds, Genes=I(Genes)) > inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow > (inds2),function(x) runif(1, min=10, max=50)))) > > So, before you hit the send button on your next e-mail. Start a > clean R session with none of your objects in the working directory > or the search path, and test your code to see if it runs. > > You will find many more willing helpers if you supply reproducible code. > > You might want to start with a new posting, too, to give more people > a fresh look. > > Jean > > > > "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 05:04:20 AM: > > > > Hi, > > > > Thanks, for the reply. > > > > I should explain more, I'll be as brief as I can, the code for > > generating the dataframe is below. > > > > What I'm doing is individual based modelling of a pathogen and it's > > host. The way I've thought of doing this is with two dataframes, one > > of the pathogen and it's genes and effectors, and one of the host > > and it's resistance genes. During the processes of the model these > > things can be pulled out of the dataframes and operated on, before > > being stored again in the dataframes. > > > > I have generated my dataset as below, it was suggested by "arun" in > > a reply to a previous email I wrote with the subject "Trouble with > > data structures". > > > > Path_Number <- 0500 # The number of pathogen individuals in the population. > > # Create the initial dataframe, with initial number of effectors and > > initial number of expressed effectors. > > inds <-data.frame(ID=formatC > > > (0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="") > > # Generate the number of effectors genes each individual has. > > inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow > > (inds),function(x) runif(1, min=1, max=550)))) > > # Generate the actual efector genes. > > Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds > > $No_of_Effectors,replace=TRUE)) > > #Add them to the dataframe > > inds <- data.frame(inds,Effectors=as.character(Effectors)) > > > > What I'm trying to do is for each individual, extract the values in > > the Effector genes cell to an object. As far as I can tell, > > > > Ind_Genes<-strsplit(as.character(inds2[1,4]),",") > > > > Will do this for the first individual or I can get all of them with > > > > All_Genes<-strsplit(as.character(inds2[,4]),",") > > > > What I then want to do is according to a generated number for each > > individual... > > > > round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, min=10,max=50)))) > > > > ... sample that many genes from Ind_Genes and make a new object > > called Expressed_Genes, which can be stored in the dataframe. My > > attempt at doing this is: > > > > Expressed_Genes<-lapply(First_Ind_Genes,function(x) sample > > (First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F)) > > > > to get Expressed genes for each individual, this might be part of a > > for loop, or to the whole list of every individuals genes like so: > > > > Expressed_Genes<-lapply(All_Genes,function(x) sample(All_Genes,3,replace=F)) > > > > What usually happens however is I get errors: > > Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)), : > > cannot take a sample larger than the population when 'replace = FALSE' > > > > or it will rather than sample 3 values, sample all the values, 3 > > times if I allow replacement (which I don't want). > > > > So it's not sampling 3 values for me, but the whole lot of values 3 times. > > > > I do not know of another way to extract these gene values and then > > do things with them. > > For my model it is essential I can pull the genes or expressed genes > > out of the dataframe, work functions or operations on them and then > > store them back again. For example if an individual turns a gene on > > that was not before, then the genes would need to be pulled from the > > database, as would the expressed genes, and a random value from the > > genes object added to the expressed genes object, and then they > > could both be put back. A similar thing would happen when I wanted > > to mutate the genes. > > > > In short my aim is pull genes or expressed genes out, work functions > > or operations on them and then store them back again. > > > > Hopefully I've explained better, I have been thinking of changing my > > approach from datasets of pathogen and host from which values are > > pulled to objects and operated on to a multi-dimentional ragged > > arrays. I've been told this may be more simple for me. > > > > Where every line is an effector gene and there can be columns for > > the gene value, expression state (1 or 0/T or F), fitness > > contribution etc. This 2D layout of rows and columns is then > > repeated in the z dimension of the array for each individual. It is > > ragged in the sense each individual, each slice through the array in > > the z direction, would have different numbers of rows - different > > numbers of effectors. I can then simulate mutations by changing the > > gene values, cause duplications by adding rows of duplicated genes, > > or even cause deletions by removing rows. > > Once I have this set up for the pathogen I may make a similar array > > for the host plants, then perhaps with indexing or some such thing I > > can write functions to do the interactions and immunology and such. > > > > Best, > > > > Ben W. > > > > UEA (ENV) & The Sainsbury Laboratory. > > > > From: Jean V Adams [jvad...@usgs.gov] > > Sent: 07 November 2012 21:12 > > To: Benjamin Ward (ENV) > > Cc: r-help@r-project.org > > Subject: Re: [R] sample from list > > > Ben, > > > > Can you provide a small example data set for > > inds > > so that we can run the code you have supplied? > > It's difficult for me to follow what you've got and where you're > trying to go. > > > > Jean > > > > > > > > "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/06/2012 03:29:52 PM: > > > > > > Hi all, > > > > > > I have a list of genes present in 500 individuals, the individuals > > > are the elements: > > > Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds > > > $No_of_Genes,replace=TRUE)) > > > > > > (This was later written to a dataframe as well as kept as the list > > > object: inds2 <- data.frame(inds,Genes=I(Genes))) > > > > > > I also have a vector of how many of those genes are expressed in > > > the individuals, this can also kept as a vector object or written to > > > a data frame: > > > > > > inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow > > > (inds2),function(x) runif(1, min=10, max=50)))) > > > > > > I want to create another list which consists of each individuals > > > expressed genes - essentially a subset of the total genes the > > > individuals have in the "Genes" list, by sampling from the Genes > > > list for each individual, the number of genes (values)in the > > > Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes = 3 then > > > sample 3 values from the element in the Genes list. I can't quite > > > figure it out though. So far I have the following: > > > > > > #Defines The number of expressed genes for each individual in my > data frame. > > > Num_Expressed_Genes <- round(as.numeric(lapply(1:nrow > > > (inds2),function(x) runif(1, min=10, max=50)))) > > > > > > > > > #My attempts to apply the sample function to every element > > > (individual organism) of the "Genes" list , to subset the genes expressed. > > > Expressed_Genes <- lapply(1:nrow(inds),function(x) sample > > > (Genes,Num_Expressed_Genes, replace=FALSE)) > > > Expressed_Genes <- lapply(Genes,function(x) sample > > > (Genes,Num_Expressed_Genes, replace=FALSE)) > > > > > > So far though I'm getting results like this: > > > > > > [[49]] > > > [[49]][[1]] > > > [1] 3540 27 5344 7278 9758 8077 ............................... [217] > > > > > > > > > [[49]][[2]] > > > [1] 740 3362 8588 8574 4371 1447 .............................. [340] > > > > > > > > > When what I need is more: > > > > > > [[49]] > > > [1] 6070 1106 6275 > > > In a case where Num_Expressed_Genes = 3 and the values are taken > > > from the much larger set of values for element (individual) 49 in my > > > Genes list. > > > > > > I'm not sure what I'm doing wrong but it seems what is happening is > > > instead of picking out a few values according to the > > > Num_Expressed_Genes vector - as an example say 3 again, It's drawing > > > a large number of values, if not all of them, from elements in the > > > list, 3 times. > > > > > > Any help is greatly appreciated, > > > I've thought of using loops to achieve the same task, but I'm trying > > > to get my individual/genes/expressed genes data.frame set up for my > > > individual based model and get it running using vectors and as > > > little loops as possible. > > > > > > Thanks, > > > Ben. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.