That makes sense David; thanks again everyone for your help. *Ben Caldwell*
Graduate Fellow University of California, Berkeley 130 Mulford Hall #3114 Berkeley, CA 94720 Office 223 Mulford Hall (510)859-3358 On Wed, Feb 13, 2013 at 4:25 PM, David Winsemius <dwinsem...@comcast.net>wrote: > > On Feb 13, 2013, at 3:33 PM, Benjamin Caldwell wrote: > > > Joshua, > > > > Thanks for the example, and the explanation. Nice article that you wrote > - > > the figures alone deserve a deeper dive, I think. > > > > As side note, I found out about profiling on another post and did it - it > > appears that my ignorance about how slow dataframes can be was the main > > problem. After turning the dfs into lists and matrixes, the improvement > was > > huge! Wish I had known about this issue ahead of time - it must be > > somewhere in the R help books in big red letters, but if it isn't it > should > > be. Seems most of the discussions about speeding up code talks about > > vectorization and avoiding fragmentation. > > I don't know if that information is in the unnamed books you read, but it > is no surprise to regular readers of Rhelp that `[<-.data.frame` is a > bottleneck. It requires a lot of extra overhead to support the possibility > of having different datatypes in each column and to maintain row numbering > correctly. > > -- > David. > > > > > > The below was for five iterations of the simulation, and I imagine was > the > > reason it was hanging up and stalling out for large simulations: > > > > > > Original: > > > > $by.total > > total.time total.pct self.time self.pct > > simulation.unpaired.c 2857.54 99.97 111.90 3.91 > > [<- 2555.74 89.41 2.78 0.10 > > [<-.data.frame 2552.96 89.32 2543.84 89.00 > > sequential.unpaired.c 149.48 5.23 2.66 0.09 > > unpaired.test.c 92.30 3.23 18.80 0.66 > > data.frame 70.72 2.47 9.68 0.34 > > as.data.frame 42.36 1.48 3.74 0.132 > > > > > > only using dataframe as the output: > > > > $by.total > > total.time total.pct self.time self.pct > > simulation.unpaired.c 109.14 100.00 0.90 0.82 > > sequential.unpaired.c 92.16 84.44 3.28 3.01 > > unpaired.test.c 84.48 77.41 21.08 19.31 > > sd 43.68 40.02 5.42 4.97 > > > > > > Thanks again folks. > > > > > > On Tue, Feb 12, 2013 at 9:45 PM, Joshua Wiley <jwiley.ps...@gmail.com > >wrote: > > > >> Hi Ben, > >> > >> That is correct about vectorizing----to some speed tests to see the > >> effects. They can be quite dramatic (like I have easily seen 300fold > >> speedups from doing that). The apply functions are essentially > >> loops---they tend to be about the same speed as loops, though slightly > >> less code. > >> > >> Compiler is nice---not it will not help already vectorized code > >> (because vectorized code really just means code that is linked to C > >> code that uses a compiled loop, but C is much faster than R. > >> > >> Sure, check out the bootstrapping (midway down) section on this page I > >> wrote: http://www.ats.ucla.edu/stat/r/dae/melogit.htm > >> > >> Cheers, > >> > >> Josh > >> > >> On Tue, Feb 12, 2013 at 12:57 PM, Benjamin Caldwell > >> <btcaldw...@berkeley.edu> wrote: > >>> Joshua, > >>> > >>> So, to your first suggestion - try to figure out whether some operation > >>> performed on every element of the vector at once could do the same > thing > >> - > >>> the answer is yes! Where can I see examples of that implemented? > >>> > >>> e.g. would it just be > >>> > >>> a <- 1:10000 > >>> b <- 1:10000 > >>> c <- 1:10000 > >>> d <- data.frame(b,c) > >>> > >>> vectorize<-function(a, d) { > >>> > >>> f <- d[,1]+d[,2]+a > >>> f > >>> } > >>> > >>> out<-vectorize(a=a,d=d) > >>> out > >>> > >>> vs > >>> > >>> loop<- function(a,d){ > >>> > >>> for(i in 1:length(d[,1])){ > >>> a[i]<-d[i,1]+d[i,2]+a[i] > >>> } > >>> a > >>> } > >>> > >>> out.l<-loop(a,d) > >>> > >>> out.l > >>> > >>> If so, it's vectorization that's the big advantage, not something > >> mysterious > >>> that's happening under the hood with the apply functions? > >>> > >>> To your second : Compiler - thanks for the tip, that's great to know! > >>> > >>> To your last, could you please give an example or point me to one that > >> was > >>> useful for you? I don't understand that well enough to use it. > >>> > >>> Thanks! > >>> > >>> Ben Caldwell > >>> > >>> Graduate Fellow > >>> University of California, Berkeley > >>> 130 Mulford Hall #3114 > >>> Berkeley, CA 94720 > >>> Office 223 Mulford Hall > >>> (510)859-3358 > >>> > >>> > >>> On Mon, Feb 11, 2013 at 10:10 PM, Joshua Wiley <jwiley.ps...@gmail.com > > > >>> wrote: > >>>> > >>>> Hi Ben, > >>>> > >>>> I appreciate that the example is reproducible, and I understand why > >>>> you shared the entire project. At the same time, that is an awful lot > >>>> of code for volunteers to go through and try to optimize. > >>>> > >>>> I would try to think of the structure of your task, see what the > >>>> individual pieces are---figure out what can be reused and optimized. > >>>> Look at loops and try to think whether some operation performed on > >>>> every element of the vector at once could do the same thing. > >>>> Sometimes the next iteration of a loop depends on the previous, so it > >>>> is difficult to vectorize, but often that is not the case. > >>>> > >>>> Make use of the compiler package, especially cmpfun(). It can give > >>>> fairly substantial speedups, particularly with for loops in R. > >>>> > >>>> Finally if you want 1000 reps, do you actually need to repeat all the > >>>> steps 1000 times, or could you just simulate 1000x the data? If the > >>>> latter, do the steps once but make more data. If the former, you > >>>> ought to be able to parallelize it fairly easily, see the parallel > >>>> package. > >>>> > >>>> Good luck, > >>>> > >>>> Josh > >>>> > >>>> > >>>> On Mon, Feb 11, 2013 at 9:22 PM, Benjamin Caldwell > >>>> <btcaldw...@berkeley.edu> wrote: > >>>>> Dear R help; > >>>>> > >>>>> I'll preface this by saying that the example I've provided below is > >>>>> pretty > >>>>> long, turgid, and otherwise a deep dive into a series of functions I > >>>>> wrote > >>>>> for a simulation study. It is, however, reproducible and > >> self-contained. > >>>>> > >>>>> I'm trying to do my first simulation study that's quite big, and so > >> I'll > >>>>> say that the output of this simulation as I'd like it to be is 9.375 > >>>>> million rows (9375 combinations of the inputs * 1000). So, first > >> thing, > >>>>> maybe I should try to cut down on the number of dimensions and do it > a > >>>>> piece at a time. If that's a conclusion people come to here, I'll > look > >>>>> into > >>>>> paring down/doing it a chunk at a time. > >>>>> > >>>>> I'm hoping, though, that this is an opportunity for me to learn some > >>>>> more > >>>>> efficient, elegant ways to a. vectorize and b. use the apply > >> functions, > >>>>> which still seem to elude me when it comes to their use with more > >>>>> complex, > >>>>> custom functions that have multiple inputs. > >>>>> > >>>>> If having looked the below over you can give me suggestions to help > >> make > >>>>> this and future code I write run more quickly/efficiently, I will be > >>>>> most grateful, as as this is currently written it would take what > >> looks > >>>>> like days to run a thousand simulations of each possible combination > >> of > >>>>> variables of interest. > >>>>> > >>>>> Best > >>>>> > >>>>> Ben Caldwell > >>>>> > >>>>> ----------------------------------------------- > >>>>> #unpaired > >>>>> verification.plots<-rnorm(30, 100, 10) > >>>>> # writeClipboard(as.character(verification.plots)) > >>>>> > >>>>> unpaired.test<- function(verification.plots, project.n, project.mean, > >>>>> project.sd, allowed.deviation, project.acres, alpha=.05){ > >>>>> > >>>>> verification.plots <-as.numeric(as.character(verification.plots)) > >>>>> a <- qnorm(alpha/2) > >>>>> d <- alpha*project.mean > >>>>> > >>>>> # verifier plot number > >>>>> > >>>>> n<-length(verification.plots) > >>>>> verifier.plot.number <- c(1:n) > >>>>> > >>>>> > >>>>> #all plots (verifier and project) > >>>>> > >>>>> all.plots.n <- rep(0,n) > >>>>> for(i in 1:n){ > >>>>> all.plots.n[i] <- project.n + verifier.plot.number[i] > >>>>> } > >>>>> > >>>>> #running mean > >>>>> X_bar <- rep(1,n) > >>>>> > >>>>> for(i in 1:n){ > >>>>> X_bar[i]<- mean(verification.plots[1:i]) > >>>>> } > >>>>> #running sd > >>>>> sd2 <- NULL > >>>>> > >>>>> for(i in 2:n){ > >>>>> sd2[i]<-sd(verification.plots[1:i]) > >>>>> } > >>>>> #Tn > >>>>> Tn<-NULL > >>>>> > >>>>> for(i in 2:n){ > >>>>> Tn[i] <- project.mean-X_bar[i] > >>>>> } > >>>>> #n_Threshold > >>>>> n_thresh<-NULL > >>>>> > >>>>> for(i in 2:n){ > >>>>> n_thresh[i] <- (((a^2)/(d^2))*((project.sd^2)+(sd2[i]^2))) > >>>>> } > >>>>> #Upper > >>>>> upper <- NULL > >>>>> for (i in 2:n){ > >>>>> upper[i] <- Tn[i]-d > >>>>> } > >>>>> #Lower > >>>>> lower<- NULL > >>>>> for(i in 2:n){ > >>>>> lower[i] <- Tn[i]+d > >>>>> } > >>>>> > >>>>> #Success.fail > >>>>> success.fail <- NULL > >>>>> > >>>>> > >>>>> > >>>>> success.fail.num <- rep(0,n) > >>>>> > >>>>> > >> > if(abs(verification.plots[1]-project.mean)<(allowed.deviation*project.mean)){ > >>>>> success.fail[1] <- "Pass" > >>>>> success.fail.num[1] <- 1 > >>>>> } else { > >>>>> success.fail[1] <- "Fail" > >>>>> success.fail.num[1] <-0 > >>>>> } > >>>>> for(i in 2:n){ > >>>>> if(all.plots.n[i]>=n_thresh[i]){ > >>>>> if(upper[i] <= 0){ > >>>>> if(lower[i] >= 0){ > >>>>> success.fail[i]<-"Pass" > >>>>> success.fail.num[i]<-1 > >>>>> } else { > >>>>> success.fail[i] <- "Fail" > >>>>> success.fail.num[i] <-0} > >>>>> } else { > >>>>> success.fail[i] <- "Fail" > >>>>> success.fail.num[i] <- 0 > >>>>> } > >>>>> > >>>>> } else { > >>>>> success.fail[i] <- "Inconclusive" > >>>>> success.fail.num[i] <- 0 > >>>>> } > >>>>> } > >>>>> > >>>>> out.unpaired.test <- data.frame(success.fail, success.fail.num) > >>>>> } > >>>>> > >>>>> > >>>>> > >> out.unpaired.test<-unpaired.test(verification.plots=verification.plots, > >>>>> project.n=30, project.mean=100, project.sd=10, > allowed.deviation=.05, > >>>>> project.acres=99) > >>>>> out.unpaired.test > >>>>> # > >>>>> sequential.unpaired<- function(number.strata, project.n, > project.mean, > >>>>> project.sd, verification.plots,allowed.deviation, project.acres, > >>>>> min.plots="default", alpha=.05){ > >>>>> > >>>>> if(min.plots=="default"){ > >>>>> min.plots<-NULL # minimum passing > >>>>> if(number.strata == 1 & project.acres <= 100){ > >>>>> min.plots = 8 > >>>>> } else if(number.strata == 1 & project.acres > 100 & project.acres < > >>>>> 500){ > >>>>> min.plots = 12 > >>>>> } else if(number.strata == 1 & project.acres > 500 & project.acres < > >>>>> 5000){ > >>>>> min.plots = 16 > >>>>> } else if(number.strata == 1 & project.acres > 5000 & > >> project.acres > >>>>> < > >>>>> 10000){ > >>>>> min.plots = 20 > >>>>> } else if(number.strata == 1 & project.acres > 10000){ > >>>>> min.plots = 24 > >>>>> } else if(number.strata == 2 & project.acres < 100){ > >>>>> min.plots = 4 > >>>>> } else if(number.strata == 2 & project.acres > 100 & project.acres < > >>>>> 500){ > >>>>> min.plots = 6 > >>>>> } else if(number.strata == 2 & project.acres > 501 & project.acres < > >>>>> 5000){ > >>>>> min.plots = 8 > >>>>> } else if(number.strata == 2 & project.acres > 5001 & project.acres < > >>>>> 10000){ > >>>>> min.plots = 10 > >>>>> } else if(number.strata == 2 & project.acres > 10000){ > >>>>> min.plots = 12 > >>>>> } else if(number.strata == 3 & project.acres < 100){ > >>>>> min.plots = 2 > >>>>> } else if(number.strata == 3 & project.acres > 100 & > >> project.acres < > >>>>> 500){ > >>>>> min.plots = 3 > >>>>> } else if(number.strata == 3 & project.acres > 501 & project.acres < > >>>>> 5000){ > >>>>> min.plots = 4 > >>>>> } else if(number.strata == 3 & project.acres > 5001 & project.acres < > >>>>> 10000){ > >>>>> min.plots = 5 > >>>>> } else if(number.strata == 3 & project.acres > 10000){ > >>>>> min.plots = 6 > >>>>> } else {min.plots=NULL} > >>>>> } else (min.plots = min.plots) > >>>>> > >>>>> > >>>>> if(number.strata > 3){print("max three strata")} > >>>>> if(number.strata > 3){break} > >>>>> > >>>>> > >>>>> p <- length(verification.plots) > >>>>> mp <- min.plots > >>>>> run <- unpaired.test(verification.plots, project.n, project.mean, > >>>>> project.sd, > >>>>> allowed.deviation, project.acres, alpha=.05) > >>>>> number.passing.plots <- function(verification.plots, > >> success.fail.num){ > >>>>> n<-length(verification.plots) > >>>>> number.passing<-rep(0,n) > >>>>> number.passing[1]<-ifelse(sum(success.fail.num[1]> mp),1,0) > >>>>> for(i in 2:n){ > >>>>> if(success.fail.num[i] == 1){ > >>>>> v <- i-1 > >>>>> number.passing[i]<-number.passing[v]+1 > >>>>> } else{number.passing[i] <- 0} > >>>>> } > >>>>> number.passing > >>>>> } > >>>>> > >>>>> > >> > number.passing<-number.passing.plots(verification.plots=verification.plots, > >>>>> success.fail.num=run$success.fail.num) > >>>>> sucessful.unsucessful<-rep("blank",p) > >>>>> one.zero <- rep(0,p) > >>>>> result <- "blank" > >>>>> resultL <- 0 > >>>>> n <-length(verification.plots) > >>>>> for(i in 1:n){ > >>>>> if(number.passing[i] >= mp) { > >>>>> sucessful.unsucessful[i] <- "successful" > >>>>> one.zero[i] <- 1 > >>>>> } else { > >>>>> sucessful.unsucessful[i]<- "unsuccessful" > >>>>> one.zero[i] <- 0 > >>>>> } > >>>>> } > >>>>> > >>>>> if(sum(one.zero) > 0 ) { > >>>>> result <- "successful" > >>>>> resultL <- 1 > >>>>> } > >>>>> > >>>>> plots.success <- function(one.zero){ > >>>>> > >>>>> plots.to.success<-NULL > >>>>> > >>>>> for(i in 1:n){ > >>>>> if(sum(one.zero)==0){plots.to.success= NA > >>>>> } else if(one.zero[i]==1){plots.to.success<-append(plots.to.success, > >>>>> i)} > >>>>> } > >>>>> plots.to.success<-min(plots.to.success) > >>>>> plots.to.success > >>>>> } > >>>>> > >>>>> plots.to.success <- plots.success(one.zero=one.zero) > >>>>> > >>>>> complete <-data.frame (min.plots, number.passing, > >> sucessful.unsucessful, > >>>>> one.zero) > >>>>> collapsed <- data.frame(result, resultL, plots.to.success) > >>>>> out <- c(as.list(complete),as.list(collapsed)) > >>>>> } > >>>>> > >>>>> > >>>>> unpaired.out<-sequential.unpaired(number.strata=2, > >>>>> verification.plots=verification.plots, project.n=15, > project.mean=100, > >>>>> project.sd=10, allowed.deviation=.05, project.acres=99, > >>>>> min.plots="default") > >>>>> > >>>>> str(unpaired.out) > >>>>> #unpaired.out[[7]][1] > >>>>> > >>>>> ## > >>>>> > >>>>> simulation.unpaired <- function(reps, project.acreage.range = c(99, > >>>>> 110,510,5100,11000), project.mean, project.n.min, project.n.max, > >>>>> project.sd.min, project.sd.max, verification.mean.min, > >>>>> verification.mean.max, number.verification.plots.min, > >>>>> number.verification.plots.max, allowed.deviation=.1, alpha=.05, > >>>>> length.out > >>>>> = 5, min.plots="default") { > >>>>> > >>>>> number.strata.range<-c(1:3) # set by CAR > >>>>> > >>>>> project.n.range <- seq(project.n.min, project.n.max, > >>>>> length.out=length.out) > >>>>> > >>>>> project.sd.range <- seq(project.sd.min, project.sd.max, > >>>>> length.out=length.out) # assume verification sd is the same as the > >>>>> project > >>>>> sd to simplify - seems a reasonable simpification > >>>>> > >>>>> number.verification.plots<- seq(number.verification.plots.min, > >>>>> number.verification.plots.max, length.out=length.out) > >>>>> > >>>>> verification.range <- seq(verification.mean.min, > >> verification.mean.max, > >>>>> length.out=length.out) > >>>>> > >>>>> permut.grid<-expand.grid(number.strata.range, project.n.range, > >>>>> project.acreage.range, project.mean, project.sd.range, > >>>>> number.verification.plots, verification.range, allowed.deviation) # > >>>>> create > >>>>> a matrix with all combinations of the supplied vectors > >>>>> > >>>>> #assign names to the colums of the grid of combinations > >>>>> names.permut<-c("number.strata", "project.n.plots", > "project.acreage", > >>>>> "project.mean", "project.sd", "number.verification.plots", > >>>>> "verification.mean", "allowed.deviation") > >>>>> > >>>>> names(permut.grid)<-names.permut # done > >>>>> > >>>>> combinations<-length(permut.grid[,1]) > >>>>> > >>>>> size <-reps*combinations #need to know the size of the master matrix, > >>>>> which > >>>>> is the the number of replications * each combination of the supplied > >>>>> factors > >>>>> > >>>>> # we want a df from which to read all the data into the simulation, > >> and > >>>>> record the results > >>>>> permut.col<-ncol(permut.grid) > >>>>> col.base<-ncol(permut.grid)+2 > >>>>> base <- matrix(nrow=size, ncol=col.base) > >>>>> base <-data.frame(base) > >>>>> > >>>>> # supply the names > >>>>> names.base<-c("number.strata", "project.n.plots", "project.acreage", > >>>>> "project.mean", "project.sd", "number.verification.plots", > >>>>> "verification.mean", "allowed.deviation","success.fail", > >>>>> "plots.to.success") > >>>>> > >>>>> names(base)<-names.base > >>>>> > >>>>> # need to create index vectors for the base, master df > >>>>> ends <- seq(reps+1, size+1, by=reps) > >>>>> begins <- ends-reps > >>>>> index <- cbind(begins, ends-1) > >>>>> #done > >>>>> > >>>>> # next, need to assign the first 6 columns and number of rows = to > the > >>>>> number of reps in the simulation to be the given row in the > >> permut.grid > >>>>> matrix > >>>>> > >>>>> pb <- winProgressBar(title="Create base, big loop 1 of 2", label="0% > >>>>> done", > >>>>> min=0, max=100, initial=0) > >>>>> > >>>>> for (i in 1:combinations) { > >>>>> > >>>>> base[index[i,1]:index[i,2],1:permut.col] <- permut.grid[i,] > >>>>> #progress bar > >>>>> info <- sprintf("%d%% done", round((i/combinations)*100)) > >>>>> setWinProgressBar(pb, (i/combinations)*100, label=info) > >>>>> } > >>>>> > >>>>> close(pb) > >>>>> > >>>>> # now, simply feed the values replicated the number of times we want > >> to > >>>>> run > >>>>> the simulation into the sequential.unpaired function, and assign the > >>>>> values > >>>>> to the appropriate columns > >>>>> > >>>>> out.index1<-ncol(permut.grid)+1 > >>>>> out.index2<-ncol(permut.grid)+2 > >>>>> > >>>>> #progress bar > >>>>> pb <- winProgressBar(title="fill values, big loop 2 of 2", label="0% > >>>>> done", > >>>>> min=0, max=100, initial=0) > >>>>> > >>>>> for (i in 1:size){ > >>>>> > >>>>> scalar.base <- base[i,] > >>>>> verification.plots <- rnorm(scalar.base$number.verification.plots, > >>>>> scalar.base$verification.mean, scalar.base$project.sd) > >>>>> result<- sequential.unpaired(scalar.base$number.strata, > >>>>> scalar.base$project.n.plots, scalar.base$project.mean, scalar.base$ > >>>>> project.sd, verification.plots, scalar.base$allowed.deviation, > >>>>> scalar.base$project.acreage, min.plots='default', alpha) > >>>>> > >>>>> base[i,out.index1] <- result[[6]][1] > >>>>> base[i,out.index2] <- result[[7]][1] > >>>>> info <- sprintf("%d%% done", round((i/size)*100)) > >>>>> setWinProgressBar(pb, (i/size)*100, label=info) > >>>>> } > >>>>> > >>>>> close(pb) > >>>>> #return the results > >>>>> return(base) > >>>>> > >>>>> } > >>>>> > >>>>> # I would like reps to = 1000, but that takes a really long time > right > >>>>> now > >>>>> test5 <- simulation.unpaired(reps=5, project.acreage.range = c(99, > >>>>> 110,510,5100,11000), project.mean=100, project.n.min=10, > >>>>> project.n.max=100, > >>>>> project.sd.min=.01, project.sd.max=.2, verification.mean.min=100, > >>>>> verification.mean.max=130, number.verification.plots.min=10, > >>>>> number.verification.plots.max=50, length.out = 5) > >>>>> > >>>>> [[alternative HTML version deleted]] > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>>> > >>>> > >>>> -- > >>>> Joshua Wiley > >>>> Ph.D. Student, Health Psychology > >>>> Programmer Analyst II, Statistical Consulting Group > >>>> University of California, Los Angeles > >>>> https://joshuawiley.com/ > >>> > >>> > >> > >> > >> > >> -- > >> Joshua Wiley > >> Ph.D. Student, Health Psychology > >> Programmer Analyst II, Statistical Consulting Group > >> University of California, Los Angeles > >> https://joshuawiley.com/ > >> > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.