Thank Jim! This makes a huge difference. Can you explain why are data frame slower than a matrix? Any other suggestions on how to improve the code would be greatly appreciated.
Thanks again! Ignacio On Thu, Jul 16, 2015 at 1:42 PM jim holtman <jholt...@gmail.com> wrote: > Actually looking at the result, you don't need the transpose; that was an > artifact of how you were doing it before. > > xm <- do.call(rbind, str_split(string = AllpairsTmp, pattern = "-")) > # convert to dataframe and do transpose on matrix and not dataframe > separoPairs <- as.data.frame((xm), stringsAsFactors = FALSE) > > > > > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > On Thu, Jul 16, 2015 at 1:37 PM, jim holtman <jholt...@gmail.com> wrote: > >> Here is one improvement. Avoid dataframes in some of these cases. This >> create a character matrix and then converts to a dataframe after doing the >> transpose of the matrix. This just takes less than 10 seconds on my system: >> >> >> > library(stringr) >> > # create character matrix; avoid dataframes in this case >> > print(proc.time()) >> user system elapsed >> 15.52 5.24 587.70 >> > xm <- do.call(rbind, str_split(string = AllpairsTmp, pattern = "-")) >> > # convert to dataframe and do transpose on matrix and not dataframe >> > separoPairs <- as.data.frame(t(xm), stringsAsFactors = FALSE) >> > print(proc.time() >> + >> + ) >> user system elapsed >> 20.90 5.36 596.57 >> > >> >> >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> Tell me what you want to do, not how you want to do it. >> >> On Thu, Jul 16, 2015 at 7:56 AM, Ignacio Martinez <ignaci...@gmail.com> >> wrote: >> >>> Hi Collin, >>> >>> The objective of the gen.names function is to generate N *unique *random >>> names, where N is a *large *number. In my computer `gen.names(n = 50000)` >>> takes under a second, so is probably not the root problem in my code. >>> That >>> said, I would love to improve it. I'm not exactly sure how you propose to >>> change it using sample. What is the object that I would be sampling? I >>> would love to run a little benchmark to compare my version with yours. >>> >>> What really takes a long time to run is: >>> >>> separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern >>> = >>> "-")) >>> >>> So that and the chunk of code before that is probably where I would get >>> big >>> gains in speed. Sadly, I have no clue how to do it differently >>> >>> Thanks a lot for the help!! >>> >>> Ignacio >>> >>> >>> On Wed, Jul 15, 2015 at 11:34 PM Collin Lynch <cfly...@ncsu.edu> wrote: >>> >>> > Hi Ignacio, If I am reading your code correctly then the top while >>> loop is >>> > essentially seeking to select a random set of names from the original >>> set, >>> > then using unique to reduce it down, you then iterate until you have >>> built >>> > your quota. Ultimately this results in a very inefficient attempt at >>> > sampling without replacement. Why not just sample without replacement >>> > rather than loop iteratively and use unique? Or if the set of possible >>> > names are short enough why not just randomize it and then pull the >>> first n >>> > items off? >>> > >>> > Best, >>> > Collin. >>> > >>> > On Wed, Jul 15, 2015 at 11:15 PM, Ignacio Martinez < >>> ignaci...@gmail.com> >>> > wrote: >>> > >>> >> Hi R-Help! >>> >> >>> >> I'm hoping that some of you may give me some tips that could make my >>> code >>> >> >>> > more efficient. More precisely, I would like to make the answer to my >>> >> stakoverflow >>> >> < >>> >> >>> http://stackoverflow.com/questions/31137940/randomly-assign-teachers-to-classrooms-imposing-restrictions >>> >>> >> > >>> > >>> > >>> >> question more efficient. >>> >> >>> >> This is the code: >>> >> >>> >> library(dplyr) >>> >> library(randomNames) >>> >> library(geosphere) >>> >> >>> > set.seed(7142015)# Define Parameters >>> > >>> > >>> >> n.Schools <- 20 >>> >> first.grade<-3 >>> >> last.grade<-5 >>> >> n.Grades <-last.grade-first.grade+1 >>> >> n.Classrooms <- 20 # THIS IS WHAT I WANTED TO BE ABLE TO CHANGE >>> >> n.Teachers <- (n.Schools*n.Grades*n.Classrooms)/2 #Two classrooms per >>> >> teacher >>> >> # Define Random names function: >>> >> gen.names <- function(n, which.names = "both", name.order = >>> "last.first"){ >>> >> names <- unique(randomNames(n=n, which.names = which.names, >>> >> name.order = name.order)) >>> >> need <- n - length(names) >>> >> while(need>0){ >>> >> names <- unique(c(randomNames(n=need, which.names = which.names, >>> >> name.order = name.order), names)) >>> >> need <- n - length(names) >>> >> } >>> >> return(names)} >>> >> # Generate n.Schools names >>> >> gen.schools <- function(n.schools) { >>> >> School.ID <- >>> >> paste0(gen.names(n = n.schools, which.names = "last"), ' School') >>> >> School.long <- rnorm(n = n.schools, mean = 21.7672, sd = 0.025) >>> >> School.lat <- rnorm(n = n.schools, mean = 58.8471, sd = 0.025) >>> >> School.RE <- rnorm(n = n.schools, mean = 0, sd = 1) >>> >> Schools <- >>> >> data.frame(School.ID, School.lat, School.long, School.RE) %>% >>> >> mutate(School.ID = as.character(School.ID)) %>% >>> >> rowwise() %>% mutate (School.distance = distHaversine( >>> >> p1 = c(School.long, School.lat), >>> >> p2 = c(21.7672, 58.8471), r = 3961 >>> >> )) >>> >> return(Schools)} >>> >> >>> >> Schools <- gen.schools(n.schools = n.Schools) >>> >> # Generate Grades >>> >> Grades <- c(first.grade:last.grade) >>> >> # Generate n.Classrooms >>> >> >>> >> Classrooms <- LETTERS[1:n.Classrooms] >>> >> # Group schools and grades >>> >> >>> >> SchGr <- outer(paste0(Schools$School.ID, '-'), paste0(Grades, '-'), >>> >> FUN="paste")#head(SchGr) >>> >> # Group SchGr and Classrooms >>> >> >>> >> SchGrClss <- outer(SchGr, paste0(Classrooms, '-'), >>> >> FUN="paste")#head(SchGrClss) >>> >> # These are the combination of School-Grades-Classroom >>> >> SchGrClssTmp <- as.matrix(SchGrClss, ncol=1, nrow=length(SchGrClss) ) >>> >> SchGrClssEnd <- as.data.frame(SchGrClssTmp) >>> >> # Assign n.Teachers (2 classroom in a given school-grade) >>> >> Allpairs <- as.data.frame(t(combn(SchGrClssTmp, 2))) >>> >> AllpairsTmp <- paste(Allpairs$V1, Allpairs$V2, sep=" ") >>> >> >>> >> library(stringr) >>> >> separoPairs <- as.data.frame(str_split(string = AllpairsTmp, pattern = >>> >> "-")) >>> >> separoPairs <- as.data.frame(t(separoPairs)) >>> >> row.names(separoPairs) <- NULL >>> >> separoPairs <- separoPairs %>% select(-V7) %>% #Drops empty column >>> >> mutate(V1=as.character(V1), V4=as.character(V4), V2=as.numeric(V2), >>> >> V5=as.numeric(V5)) %>% mutate(V4 = trimws(V4, which = "both")) >>> >> >>> >> separoPairs[120,]$V4#Only the rows with V1=V4 and V2=V5 are valid >>> > >>> > >>> >> validPairs <- separoPairs %>% filter(V1==V4 & V2==V5) %>% select(V1, >>> V2, >>> >> V3, V6) >>> >> # Generate n.Teachers >>> >> >>> >> gen.teachers <- function(n.teachers){ >>> >> Teacher.ID <- gen.names(n = n.teachers, name.order = "last.first") >>> >> Teacher.exp <- runif(n = n.teachers, min = 1, max = 30) >>> >> Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5), >>> >> size = n.teachers) >>> >> Teacher.RE <- rnorm(n = n.teachers, mean = 0, sd = 1) >>> >> Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other, >>> >> Teacher.RE) >>> >> return(Teachers)} >>> >> Teachers <- gen.teachers(n.teachers = n.Teachers) %>% >>> >> mutate(Teacher.ID = as.character(Teacher.ID)) >>> >> # Randomly assign n.Teachers teachers to the "ValidPairs" >>> >> TmpAssignments <- validPairs[sample(1:nrow(validPairs), n.Teachers), ] >>> >> Assignments <- cbind.data.frame(Teachers$Teacher.ID, TmpAssignments) >>> >> names(Assignments) <- c("Teacher.ID", "School.ID", "Grade", "Class_1", >>> >> "Class_2") >>> >> # Tidy Data >>> >> library(tidyr) >>> >> TeacherClassroom <- Assignments %>% >>> >> gather(x, Classroom, Class_1,Class_2) %>% >>> >> select(-x) %>% >>> >> mutate(Teacher.ID = as.character(Teacher.ID)) >>> >> # Merge >>> >> DF_Classrooms <- TeacherClassroom %>% full_join(Teachers, >>> >> by="Teacher.ID") %>% full_join(Schools, by="School.ID") >>> >> rm(list=setdiff(ls(), "DF_Classrooms")) # Clean the work space! >>> >> >>> >> *I want to end up with the same* 'DF_Classrooms *data frame* but >>> getting >>> > >>> > >>> >> there in a more efficient way. In particular, when is use n.Classrooms >>> >> <-4 the >>> >> >>> > code run fast, but *if I increase it to something like 20 it is >>> painfully >>> >> slow.* >>> >> >>> >> Thanks!!! >>> >> >>> >> [[alternative HTML version deleted]] >>> >> >>> >> ______________________________________________ >>> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> >> https://stat.ethz.ch/mailman/listinfo/r-help >>> >> PLEASE do read the posting guide >>> >> http://www.R-project.org/posting-guide.html >>> >> and provide commented, minimal, self-contained, reproducible code. >>> >> >>> > >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.