Re: [R] improving/speeding up a very large, slow simulation

Benjamin Caldwell Tue, 12 Feb 2013 12:40:13 -0800

Anindya,

Completely understand why you wouldn't want to look into the code. I don't
understand your suggestion well enough to implement it, I lost you at
"unlist your function variable . . ."; I don't understand why one would
want to do that. Are there examples you'd suggest? If not I'll take a look
at the help files for mclapply.


Best

*Ben Caldwell*

Graduate Fellow
University of California, Berkeley
130 Mulford Hall #3114
Berkeley, CA 94720
Office 223 Mulford Hall
(510)859-3358


On Mon, Feb 11, 2013 at 10:01 PM, Anindya Sankar Dey <anindy...@gmail.com>wrote:

> Hi,
>
> I've just read your intro, not the full code. But one way first divide
> your input data into parts and make a list where each element of the list
> is one part of your input data. Rest of code put in a function, which will
> work on a part of your data. Build the function such that it takes list as
> input. At the beginning of the function unlist your function variable, and
> then convert to required data frame. Then use mclapply of
> multicore/parallel package and you'll get results much faster.
>
> Well this is the way I'd have tried, but there are other experts who can
> give a better way to do it.
>
> Regards,
> Anindya
>
> On Tue, Feb 12, 2013 at 10:52 AM, Benjamin Caldwell <
> btcaldw...@berkeley.edu> wrote:
>
>> Dear R help;
>>
>> I'll preface this by saying that the example I've provided below is pretty
>> long, turgid, and otherwise a deep dive into a series of functions I wrote
>> for a simulation study. It is, however, reproducible and self-contained.
>>
>> I'm trying to do my first simulation study that's quite big, and so I'll
>> say that the output of this simulation as I'd like it to be is 9.375
>> million rows (9375 combinations of the inputs * 1000). So, first thing,
>> maybe I should try to cut down on the number of dimensions and do it a
>> piece at a time. If that's a conclusion people come to here, I'll look
>> into
>> paring down/doing it a chunk at a time.
>>
>> I'm hoping, though, that this is an opportunity for me to learn some more
>> efficient, elegant ways to a. vectorize and b. use the apply functions,
>> which still seem to elude me when it comes to their use with more complex,
>> custom functions that have multiple inputs.
>>
>> If having looked the below over you can give me suggestions to help make
>> this and future code I write run more quickly/efficiently, I will be
>> most grateful, as as this is currently written it would take what looks
>> like days to run a thousand simulations of each possible combination of
>> variables of interest.
>>
>> Best
>>
>> Ben Caldwell
>>
>> -----------------------------------------------
>> #unpaired
>> verification.plots<-rnorm(30, 100, 10)
>> # writeClipboard(as.character(verification.plots))
>>
>> unpaired.test<- function(verification.plots, project.n, project.mean,
>> project.sd, allowed.deviation, project.acres, alpha=.05){
>>
>> verification.plots <-as.numeric(as.character(verification.plots))
>> a <- qnorm(alpha/2)
>> d <- alpha*project.mean
>>
>> # verifier plot number
>>
>> n<-length(verification.plots)
>> verifier.plot.number <- c(1:n)
>>
>>
>> #all plots (verifier and project)
>>
>> all.plots.n <- rep(0,n)
>> for(i in 1:n){
>> all.plots.n[i] <- project.n + verifier.plot.number[i]
>> }
>>
>> #running mean
>> X_bar <- rep(1,n)
>>
>> for(i in 1:n){
>> X_bar[i]<- mean(verification.plots[1:i])
>> }
>>  #running sd
>> sd2 <- NULL
>>
>> for(i in 2:n){
>> sd2[i]<-sd(verification.plots[1:i])
>>  }
>> #Tn
>> Tn<-NULL
>>
>> for(i in 2:n){
>> Tn[i] <- project.mean-X_bar[i]
>>  }
>> #n_Threshold
>> n_thresh<-NULL
>>
>> for(i in 2:n){
>> n_thresh[i] <- (((a^2)/(d^2))*((project.sd^2)+(sd2[i]^2)))
>>  }
>> #Upper
>> upper <- NULL
>> for (i in 2:n){
>> upper[i] <- Tn[i]-d
>> }
>> #Lower
>> lower<- NULL
>> for(i in 2:n){
>> lower[i] <- Tn[i]+d
>> }
>>
>> #Success.fail
>> success.fail <- NULL
>>
>>
>>
>> success.fail.num <- rep(0,n)
>>
>> if(abs(verification.plots[1]-project.mean)<(allowed.deviation*project.mean)){
>>  success.fail[1] <- "Pass"
>> success.fail.num[1] <- 1
>> } else {
>>  success.fail[1] <- "Fail"
>> success.fail.num[1] <-0
>> }
>>  for(i in 2:n){
>> if(all.plots.n[i]>=n_thresh[i]){
>> if(upper[i] <= 0){
>>  if(lower[i] >= 0){
>> success.fail[i]<-"Pass"
>> success.fail.num[i]<-1
>>  } else {
>> success.fail[i] <- "Fail"
>> success.fail.num[i] <-0}
>>  } else {
>> success.fail[i] <- "Fail"
>> success.fail.num[i] <- 0
>>  }
>>
>> } else {
>> success.fail[i] <- "Inconclusive"
>>  success.fail.num[i] <- 0
>> }
>> }
>>
>> out.unpaired.test <- data.frame(success.fail, success.fail.num)
>> }
>>
>>
>> out.unpaired.test<-unpaired.test(verification.plots=verification.plots,
>> project.n=30, project.mean=100, project.sd=10, allowed.deviation=.05,
>> project.acres=99)
>> out.unpaired.test
>> #
>> sequential.unpaired<- function(number.strata, project.n, project.mean,
>> project.sd, verification.plots,allowed.deviation, project.acres,
>> min.plots="default", alpha=.05){
>>
>> if(min.plots=="default"){
>> min.plots<-NULL # minimum passing
>>  if(number.strata == 1 & project.acres <= 100){
>>     min.plots = 8
>> } else if(number.strata == 1 & project.acres > 100 & project.acres < 500){
>> min.plots = 12
>>  } else if(number.strata == 1 & project.acres > 500 & project.acres <
>> 5000){
>>     min.plots = 16
>>     } else if(number.strata == 1 & project.acres > 5000 & project.acres <
>> 10000){
>>     min.plots = 20
>> } else if(number.strata == 1 & project.acres > 10000){
>>     min.plots = 24
>>     } else if(number.strata == 2 & project.acres < 100){
>>     min.plots = 4
>> } else if(number.strata == 2 & project.acres > 100 & project.acres < 500){
>>     min.plots = 6
>> } else if(number.strata == 2 & project.acres > 501 & project.acres <
>> 5000){
>>     min.plots = 8
>> } else if(number.strata == 2 & project.acres > 5001 & project.acres <
>> 10000){
>>     min.plots = 10
>>     } else if(number.strata == 2 & project.acres > 10000){
>>     min.plots = 12
>> } else if(number.strata == 3 & project.acres < 100){
>>     min.plots = 2
>>     } else if(number.strata == 3 & project.acres > 100 & project.acres <
>> 500){
>>  min.plots = 3
>> } else if(number.strata == 3 & project.acres > 501 & project.acres <
>> 5000){
>>  min.plots = 4
>> } else if(number.strata == 3 & project.acres > 5001 & project.acres <
>> 10000){
>> min.plots = 5
>>  } else if(number.strata == 3 & project.acres > 10000){
>> min.plots = 6
>> } else {min.plots=NULL}
>> } else (min.plots = min.plots)
>>
>>
>> if(number.strata > 3){print("max three strata")}
>> if(number.strata > 3){break}
>>
>>
>> p <- length(verification.plots)
>>  mp <- min.plots
>> run <- unpaired.test(verification.plots, project.n, project.mean,
>> project.sd,
>> allowed.deviation, project.acres, alpha=.05)
>>  number.passing.plots <- function(verification.plots, success.fail.num){
>>  n<-length(verification.plots)
>>  number.passing<-rep(0,n)
>> number.passing[1]<-ifelse(sum(success.fail.num[1]> mp),1,0)
>> for(i in 2:n){
>>  if(success.fail.num[i] == 1){
>> v <- i-1
>> number.passing[i]<-number.passing[v]+1
>>  } else{number.passing[i] <- 0}
>> }
>> number.passing
>>  }
>>
>> number.passing<-number.passing.plots(verification.plots=verification.plots,
>> success.fail.num=run$success.fail.num)
>>  sucessful.unsucessful<-rep("blank",p)
>> one.zero <- rep(0,p)
>> result <- "blank"
>> resultL <- 0
>> n <-length(verification.plots)
>> for(i in 1:n){
>> if(number.passing[i] >= mp) {
>> sucessful.unsucessful[i] <- "successful"
>>  one.zero[i] <- 1
>> } else {
>> sucessful.unsucessful[i]<- "unsuccessful"
>>  one.zero[i] <- 0
>> }
>> }
>>
>> if(sum(one.zero) > 0 ) {
>>  result <- "successful"
>> resultL <- 1
>> }
>>
>> plots.success <- function(one.zero){
>>
>> plots.to.success<-NULL
>>
>> for(i in 1:n){
>> if(sum(one.zero)==0){plots.to.success= NA
>>  } else if(one.zero[i]==1){plots.to.success<-append(plots.to.success, i)}
>> }
>> plots.to.success<-min(plots.to.success)
>> plots.to.success
>> }
>>
>> plots.to.success <- plots.success(one.zero=one.zero)
>>
>> complete <-data.frame (min.plots, number.passing, sucessful.unsucessful,
>> one.zero)
>> collapsed <- data.frame(result, resultL, plots.to.success)
>> out <- c(as.list(complete),as.list(collapsed))
>> }
>>
>>
>> unpaired.out<-sequential.unpaired(number.strata=2,
>> verification.plots=verification.plots, project.n=15, project.mean=100,
>> project.sd=10, allowed.deviation=.05, project.acres=99,
>> min.plots="default")
>>
>> str(unpaired.out)
>> #unpaired.out[[7]][1]
>>
>> ##
>>
>> simulation.unpaired <- function(reps, project.acreage.range = c(99,
>> 110,510,5100,11000), project.mean, project.n.min, project.n.max,
>> project.sd.min, project.sd.max, verification.mean.min,
>> verification.mean.max, number.verification.plots.min,
>> number.verification.plots.max, allowed.deviation=.1, alpha=.05, length.out
>> = 5, min.plots="default") {
>>
>> number.strata.range<-c(1:3) # set by CAR
>>
>> project.n.range <- seq(project.n.min, project.n.max,
>> length.out=length.out)
>>
>> project.sd.range <- seq(project.sd.min, project.sd.max,
>> length.out=length.out) # assume verification sd is the same as the project
>> sd to simplify - seems a reasonable simpification
>>
>> number.verification.plots<- seq(number.verification.plots.min,
>> number.verification.plots.max, length.out=length.out)
>>
>> verification.range <- seq(verification.mean.min, verification.mean.max,
>> length.out=length.out)
>>
>> permut.grid<-expand.grid(number.strata.range, project.n.range,
>> project.acreage.range, project.mean, project.sd.range,
>> number.verification.plots, verification.range, allowed.deviation) # create
>> a matrix with all combinations of the supplied vectors
>>
>> #assign names to the colums of the grid of combinations
>> names.permut<-c("number.strata", "project.n.plots", "project.acreage",
>> "project.mean", "project.sd", "number.verification.plots",
>> "verification.mean", "allowed.deviation")
>>
>> names(permut.grid)<-names.permut # done
>>
>> combinations<-length(permut.grid[,1])
>>
>> size <-reps*combinations #need to know the size of the master matrix,
>> which
>> is the the number of replications * each combination of the supplied
>> factors
>>
>> # we want a df from which to read all the data into the simulation, and
>> record the results
>> permut.col<-ncol(permut.grid)
>> col.base<-ncol(permut.grid)+2
>> base <- matrix(nrow=size, ncol=col.base)
>> base <-data.frame(base)
>>
>> # supply the names
>> names.base<-c("number.strata", "project.n.plots", "project.acreage",
>> "project.mean", "project.sd", "number.verification.plots",
>> "verification.mean", "allowed.deviation","success.fail",
>> "plots.to.success")
>>
>> names(base)<-names.base
>>
>> # need to create index vectors for the base, master df
>> ends <- seq(reps+1, size+1, by=reps)
>> begins <- ends-reps
>> index <- cbind(begins, ends-1)
>> #done
>>
>> # next, need to assign the first 6 columns and number of rows = to the
>> number of reps in the simulation to be the given row in the permut.grid
>> matrix
>>
>> pb <- winProgressBar(title="Create base, big loop 1 of 2", label="0%
>> done",
>> min=0, max=100, initial=0)
>>
>> for (i in 1:combinations) {
>>
>> base[index[i,1]:index[i,2],1:permut.col] <- permut.grid[i,]
>> #progress bar
>>  info <- sprintf("%d%% done", round((i/combinations)*100))
>> setWinProgressBar(pb, (i/combinations)*100, label=info)
>> }
>>
>> close(pb)
>>
>> # now, simply feed the values replicated the number of times we want to
>> run
>> the simulation into the sequential.unpaired function, and assign the
>> values
>> to the appropriate columns
>>
>> out.index1<-ncol(permut.grid)+1
>> out.index2<-ncol(permut.grid)+2
>>
>> #progress bar
>> pb <- winProgressBar(title="fill values, big loop 2 of 2", label="0%
>> done",
>> min=0, max=100, initial=0)
>>
>> for (i in 1:size){
>>
>> scalar.base <- base[i,]
>>  verification.plots <- rnorm(scalar.base$number.verification.plots,
>> scalar.base$verification.mean, scalar.base$project.sd)
>>  result<- sequential.unpaired(scalar.base$number.strata,
>> scalar.base$project.n.plots, scalar.base$project.mean, scalar.base$
>> project.sd, verification.plots, scalar.base$allowed.deviation,
>> scalar.base$project.acreage, min.plots='default', alpha)
>>
>> base[i,out.index1] <- result[[6]][1]
>> base[i,out.index2] <- result[[7]][1]
>>  info <- sprintf("%d%% done", round((i/size)*100))
>> setWinProgressBar(pb, (i/size)*100, label=info)
>> }
>>
>> close(pb)
>> #return the results
>> return(base)
>>
>> }
>>
>> # I would like reps to = 1000, but that takes a really long time right now
>> test5 <- simulation.unpaired(reps=5, project.acreage.range = c(99,
>> 110,510,5100,11000), project.mean=100, project.n.min=10,
>> project.n.max=100,
>> project.sd.min=.01,  project.sd.max=.2, verification.mean.min=100,
>> verification.mean.max=130, number.verification.plots.min=10,
>> number.verification.plots.max=50, length.out = 5)
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Anindya Sankar Dey
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] improving/speeding up a very large, slow simulation

Reply via email to