Re: [R] Extract cell of many values from dataframe cells and sample from them.

Jean V Adams Thu, 08 Nov 2012 12:01:24 -0800

Ben,

I think you would find lists a helpful way to arrange your data.  They do 
not require equal lengths of data in each element.  Check out the code 
below for a smaller version of the example you provided (with only 5 
individuals rather than 500).


# An alternative way to arrange your data, as a list
# Each element of the list is an individual, with all its effector genes
ID.unique <- formatC(0001:0005, width=4, flag=0)
No_of_Effectors <- sample(1:550, length(ID.unique), replace=TRUE)
Effectors <- split(sample(1:10000, sum(No_of_Effectors), replace=TRUE), 
rep(ID.unique, No_of_Effectors))
Effectors

# Now take a random sample of effectors from each individual
Expressed_Genes <- lapply(Effectors, function(x) sample(x, 
sample(1:length(x), 1)))
Expressed_Genes

Jean



"Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 10:00:57 AM:
> 
> Hi,
> 
> First my apologies for a non-working piece of code in a previous 
> submission, I have corrected this error.
> 
> I'm doing is individual based modelling of a pathogen and it's host.
> The way I've thought of doing this is with two dataframes, one of 
> the pathogen and it's genes and effector genes, and one of the host 
> and it's resistance genes. During the simulation, these things can 
> be pulled out of the dataframes and operated on, before being stored
> again in the dataframes. 
> 
> Below is how I've created my dataframe and stored my effector genes.
> In this model, effector genes are numerical values between 1 and 10000.
> 
> Path_Number <- 0500 
> inds <- data.frame(ID=formatC
> 
(0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> (inds),function(x) runif(1, min=1, max=550)))) 
> Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Effectors,replace=TRUE)) 
> inds <- data.frame(inds,Effectors=as.character(Effectors)) 
> Ind_Genes <- strsplit(as.character(inds[1,4]),",")
> 
> What I'm trying to do is: 
> 1). For each individual (row) in my database, extract the values in 
> the "Effectors" cell to an object. 
> 2). Sample a number of those values and assign them to a new object 
> called "Expressed_Effectors"
> 3). Storing it in the Expressed_Effectors cell, in much the same 
> manner as I stored the Effectors object in the "Effectors" cell.
> 
> My example attempt (for the first row/individual in my dataset) is 
below:
> 
> (step by step, I didn't put this in a loop until I know it works for 1 
row)
> 
> Extract the values (effector genes) for the first individual, from 
> the Effectors Cell in the dataframe, to "Ind_Effectors" object.
> Ind_Effectors <- strsplit(as.character(inds[1,4]),",")
> 
> Randomly dictate how many values (effectors) will be sampled 
> n<-round(runif(1, min=10, max=50))
> 
> Sample n values (effector genes) from "Ind_Effectors", not replacing
> Expressed_Genes <- sample(Ind_Effectors,n,replace=F)
> 
> If I run this I receive the error:
> Error in sample(Ind_Effectors, n, replace = F) : 
>   cannot take a sample larger than the population when 'replace = FALSE'
> 
> What I think this means is rather than picking out n values from the
> whole set of values in "Ind_Effectors" it's trying to sample the 
> whole lot n times, which it cannot do because replace=F. This is not
> what I need, what I need is n values sampled from "Ind_Effectors", 
> not all values from Ind_Effectors sampled n times.
> 
> I hope this clears up the confusion with what I'm trying to do. It 
> may very well be I'm not instructing R to sample as a require 
> properly. Sadly my previous experience with R amounts to loading in 
> dataframes from experiment and doing stat analysis & model fitting, 
> not simulations or individual based models.
> 
> Best wishes,
> 
> Ben W.
> UEA (ENV) & The Sainsbury Laboratory.
> 
> P.S. As an aside I've been thinking about doing this model an 
> alternative way to as I described in the first bit of my email 
> (based on dataframes).
> Instead I would use a multi-dimentional ragged array(s):
> The format would be a 2D layout, Where every line is an effector 
> gene and every column an aspect of the effector gene(value, 
> expression state, fitness contribution etc.) This 2D layout of rows 
> and columns is then repeated in the 3rd dimension (the z of x,y,z) 
> of the array for each individual. It is ragged in the sense each 
> individual, each slice through the array in the z direction, would 
> have different numbers of rows - different numbers of effectors. 
> This may be easier to work on, but I've not worked with 
> multidimensional arrays, I'm used to data in dataframes (usually 
> from spreadsheets from experiments). 
> 
> From: Jean V Adams [jvad...@usgs.gov]
> Sent: 08 November 2012 13:35
> To: Benjamin Ward (ENV)
> Cc: r-help@r-project.org
> Subject: RE: [R] sample from list

> Ben, 
> 
> You have still not supplied reproducible code for me (and any other 
> r-help reader) to run, which makes it very difficult to help you.  I
> can run your first 5 lines of code with no problem. 
> 
> Path_Number <- 0500 
> inds <-data.frame(ID=formatC
> 
(0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> (inds),function(x) runif(1, min=1, max=550)))) 
> Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Effectors,replace=TRUE)) 
> inds <- data.frame(inds,Effectors=as.character(Effectors)) 
> 
> But your 6th line of code doesn't work ... there is no object inds2. 
> 
> Ind_Genes<-strsplit(as.character(inds2[1,4]),",") 
> 
> If I use code that you provided in your earlier e-mail to create 
> inds2, I get errors because inds doesn't have a variable No_of_Genes. 
> 
> Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> $No_of_Genes,replace=TRUE)) 
> inds2 <- data.frame(inds, Genes=I(Genes)) 
> inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> (inds2),function(x) runif(1, min=10, max=50)))) 
> 
> So, before you hit the send button on your next e-mail.  Start a 
> clean R session with none of your objects in the working directory 
> or the search path, and test your code to see if it runs. 
> 
> You will find many more willing helpers if you supply reproducible code. 

> 
> You might want to start with a new posting, too, to give more people
> a fresh look. 
> 
> Jean 
> 
> 
> 
> "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/08/2012 05:04:20 
AM:
> > 
> > Hi, 
> > 
> > Thanks, for the reply. 
> > 
> > I should explain more, I'll be as brief as I can, the code for 
> > generating the dataframe is below. 
> > 
> > What I'm doing is individual based modelling of a pathogen and it's 
> > host. The way I've thought of doing this is with two dataframes, one
> > of the pathogen and it's genes and effectors, and one of the host 
> > and it's resistance genes. During the processes of the model these 
> > things can be pulled out of the dataframes and operated on, before 
> > being stored again in the dataframes. 
> > 
> > I have generated my dataset as below, it was suggested by "arun" in 
> > a reply to a previous email I wrote with the subject "Trouble with 
> > data structures". 
> > 
> > Path_Number <- 0500 # The number of pathogen individuals in the 
population.
> > # Create the initial dataframe, with initial number of effectors and
> > initial number of expressed effectors. 
> > inds <-data.frame(ID=formatC
> > 
> 
(0001:Path_Number,width=4,flag=0),No_of_Effectors="",No_Expressed_Effectors="")
> > # Generate the number of effectors genes each individual has. 
> > inds$No_of_Effectors <- round(as.numeric(lapply(1:nrow
> > (inds),function(x) runif(1, min=1, max=550)))) 
> > # Generate the actual efector genes. 
> > Effectors <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> > $No_of_Effectors,replace=TRUE)) 
> > #Add them to the dataframe 
> > inds <- data.frame(inds,Effectors=as.character(Effectors)) 
> > 
> > What I'm trying to do is for each individual, extract the values in 
> > the Effector genes cell to an object. As far as I can tell, 
> > 
> > Ind_Genes<-strsplit(as.character(inds2[1,4]),",") 
> > 
> > Will do this for the first individual or I can get all of them with 
> > 
> > All_Genes<-strsplit(as.character(inds2[,4]),",") 
> > 
> > What I then want to do is according to a generated number for each 
> > individual... 
> > 
> > round(as.numeric(lapply(1:nrow(inds2),function(x) runif(1, 
min=10,max=50))))
> > 
> > ... sample that many genes from Ind_Genes and make a new object 
> > called Expressed_Genes, which can be stored in the dataframe. My 
> > attempt at doing this is: 
> > 
> > Expressed_Genes<-lapply(First_Ind_Genes,function(x) sample
> > (First_Ind_Genes,round(runif(1, min=10, max=50)),replace=F)) 
> > 
> > to get Expressed genes for each individual, this might be part of a 
> > for loop, or to the whole list of every individuals genes like so: 
> > 
> > Expressed_Genes<-lapply(All_Genes,function(x) 
sample(All_Genes,3,replace=F))
> > 
> > What usually happens however is I get errors: 
> > Error in sample(First_Ind_Genes, round(runif(1, min = 10, max = 50)), 
: 
> >   cannot take a sample larger than the population when 'replace = 
FALSE' 
> > 
> > or it will rather than sample 3 values, sample all the values, 3 
> > times if I allow replacement (which I don't want). 
> > 
> > So it's not sampling 3 values for me, but the whole lot of values 3 
times. 
> > 
> > I do not know of another way to extract these gene values and then 
> > do things with them. 
> > For my model it is essential I can pull the genes or expressed genes
> > out of the dataframe, work functions or operations on them and then 
> > store them back again. For example if an individual turns a gene on 
> > that was not before, then the genes would need to be pulled from the
> > database, as would the expressed genes, and a random value from the 
> > genes object added to the expressed genes object, and then they 
> > could both be put back. A similar thing would happen when I wanted 
> > to mutate the genes. 
> > 
> > In short my aim is pull genes or expressed genes out, work functions
> > or operations on them and then store them back again. 
> > 
> > Hopefully I've explained better, I have been thinking of changing my
> > approach from datasets of pathogen and host from which values are 
> > pulled to objects and operated on to a multi-dimentional ragged 
> > arrays. I've been told this may be more simple for me. 
> > 
> > Where every line is an effector gene and there can be columns for 
> > the gene value, expression state (1 or 0/T or F), fitness 
> > contribution etc. This 2D layout of rows and columns is then 
> > repeated in the z dimension of the array for each individual. It is 
> > ragged in the sense each individual, each slice through the array in
> > the z direction, would have different numbers of rows - different 
> > numbers of effectors. I can then simulate mutations by changing the 
> > gene values, cause duplications by adding rows of duplicated genes, 
> > or even cause deletions by removing rows. 
> > Once I have this set up for the pathogen I may make a similar array 
> > for the host plants, then perhaps with indexing or some such thing I
> > can write functions to do the interactions and immunology and such. 
> > 
> > Best, 
> > 
> > Ben W. 
> > 
> > UEA (ENV) & The Sainsbury Laboratory. 
> > 
> > From: Jean V Adams [jvad...@usgs.gov]
> > Sent: 07 November 2012 21:12
> > To: Benjamin Ward (ENV)
> > Cc: r-help@r-project.org
> > Subject: Re: [R] sample from list
> 
> > Ben, 
> > 
> > Can you provide a small example data set for 
> >         inds 
> > so that we can run the code you have supplied? 
> > It's difficult for me to follow what you've got and where you're 
> trying to go.
> > 
> > Jean 
> > 
> > 
> > 
> > "Benjamin Ward (ENV)" <b.w...@uea.ac.uk> wrote on 11/06/2012 03:29:52 
PM:
> > > 
> > > Hi all,
> > > 
> > > I have a list of genes present in 500 individuals, the individuals 
> > > are the elements:
> > > Genes <- lapply(1:nrow(inds),function(x) sample(1:10000,inds
> > > $No_of_Genes,replace=TRUE))
> > > 
> > > (This was later written to a dataframe as well as kept as the list 
> > > object: inds2 <- data.frame(inds,Genes=I(Genes)))
> > > 
> > > I also have a vector of  how many of those genes are expressed in 
> > > the individuals, this can also kept as a vector object or written to
> > > a data frame:
> > > 
> > > inds2$No_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > > (inds2),function(x) runif(1, min=10, max=50))))
> > > 
> > > I want to create another list which consists of each individuals 
> > > expressed genes - essentially a subset of the total genes the 
> > > individuals have in the "Genes" list, by sampling from the Genes 
> > > list for each individual, the number of genes (values)in the 
> > > Num_Expressed_Genes vector. i.e. if Num_Expressed_Genes = 3 then 
> > > sample 3 values from the element in the Genes list. I can't quite 
> > > figure it out though. So far I have the following:
> > > 
> > > #Defines The number of expressed genes for each individual in my
> data frame.
> > > Num_Expressed_Genes <- round(as.numeric(lapply(1:nrow
> > > (inds2),function(x) runif(1, min=10, max=50))))
> > > 
> > > 
> > > #My attempts to apply the sample function to every element 
> > > (individual organism) of the "Genes" list , to subset the genes 
expressed.
> > > Expressed_Genes <- lapply(1:nrow(inds),function(x) sample
> > > (Genes,Num_Expressed_Genes, replace=FALSE))
> > > Expressed_Genes <- lapply(Genes,function(x) sample
> > > (Genes,Num_Expressed_Genes, replace=FALSE))
> > > 
> > > So far though I'm getting results like this:
> > > 
> > > [[49]]
> > > [[49]][[1]]
> > >   [1] 3540   27 5344 7278 9758 8077 ............................... 
[217]
> > > 
> > > 
> > > [[49]][[2]]
> > >   [1]  740 3362 8588 8574 4371 1447 .............................. 
[340]
> > > 
> > > 
> > > When what I need is more:
> > > 
> > > [[49]]
> > > [1] 6070 1106 6275
> > > In a case where Num_Expressed_Genes = 3 and the values are taken 
> > > from the much larger set of values for element (individual) 49 in my
> > > Genes list.
> > > 
> > > I'm not sure what I'm doing wrong but it seems what is happening is 
> > > instead of picking out a few values according to the 
> > > Num_Expressed_Genes vector - as an example say 3 again, It's drawing
> > > a large number of values, if not all of them, from elements in the 
> > > list, 3 times.
> > > 
> > > Any help is greatly appreciated,
> > > I've thought of using loops to achieve the same task, but I'm trying
> > > to get my individual/genes/expressed genes data.frame set up for my 
> > > individual based model and get it running using vectors and as 
> > > little loops as possible.
> > > 
> > > Thanks,
> > > Ben.
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extract cell of many values from dataframe cells and sample from them.

Reply via email to