On Wed, 11 Jan 2012, Ray Brownrigg wrote: > On Wed, 11 Jan 2012, iliketurtles wrote: > > ##I have 2 columns of data. The first column is unique "event IDs" that > > represent a phone call made to a customer. > > ###So, if you see 3 entries together in the first column like follows: > > > > matrix(c("call1a","call1a","call1a") ) > > > > ##then this means that this particular phone call (the first call that's > > logged in the data set) was transferred > > ##between 3 different "modules" before the call was terminated. > > > > ##The second column is a numerical description of the module the call > > started with and then got transferred to prior to ##call termination. > > Now, I'll construct a ##representative array of the type of data I'm > > dealing with (the real data set goes ##on for X00,000s of rows): > > ##(Ignore how I construct the following array, it’s completely unrelated > > to how the actual data set was constructed). > > > > > > a<-sapply(1:50,function(i){paste("call",i,sep="",collapse="")}) > > development.a<-seq(1,40,3) > > development.a2<-seq(1,40,5) > > a[development.a]<-a[development.a+1] > > a[development.a2]<-a[development.a2+1] > > a[1:2]<-"call2a";a[3]<-"call3a";a[4:5]<-"call5a";a[6:8]<-"call8a";a[9]<-" > > ca ll9a" > > b<-c(920010,960010,820009,920010,960500,970050,930010,920010,960500,97005 > > 0 > > ,930900,870010,840010,960500,920010,970050,930010,960500,920010,970050,9 > > 300 > > 10,960010,920010,940010,960010,970010,960500,920010,970050,930010,960500 > > ,92 > > 0010,970050,930010,960500,920010,970050,930010,920010,960500,970050,9300 > > 10, 920009,960500,970050,930009,940010,960500,960500,960500) > > data<-as.data.frame(cbind(a,b)) > > colnames(data)<-c("phone calls","modules") > > dim(data) > > print(data[1:10,]) #sample of 10 rows > > > > # Note that in the real data set, data[,2] ranges from 810,000 to > > 999,999. I've been tasked with the following: > > # "For each phone call that BEGINS with the module which is denoted by 81 > > (i.e. of the form 81X,XXX), what is the expected number of modules in > > these calls?" > > #Then it's the same question for each module beginning with 82, 83, > > 84..... all the way until 99. > > #I've created code that I think works for this, but I can't actually run > > it on the whole data set. I left it for 30 minutes and it only had about > > #5% of the task completed (I clicked "STOP" then checked my output to > > see if I did it properly, and it seems correct). > > #I know the apply() family specializes in vector operations, but I can't > > figure out how to complete the above question in any way other than > > #loops. > > > > L<-data > > > > A<-array(0,dim=c(19,2));rownames(A)<-seq(81,99,1) > > A<-data.frame(A) > > > > for(i in 1:(nrow(L)-1)) > > { > > > > if(L[(i+1),1]!=L[i,1]) > > { > > > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="") > > ,1 ]<- { > > > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="") > > ,1 ]+length(grep(as.character(L[i+1,1]),L[,1],value=FALSE)) #aggregate > > number of modules in the calls that begin with XX (not yet averaged). > > > > } > > > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="") > > ,2 ]<- { > > > > A[paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="") > > ,2 ]+1 } > > > > } > > > > } > > > > #If I can get this code to be more memory efficient such that I can do it > > on a 400,000 row data set, I can do, for example, > > > > A[17,1]/A[17,2] > > > > #and I'll arrive at the mean number of modules per call where the call > > starts with a module that starts with 97. > > > > A[17,1] > > #is 10, which means that, out of every single call that started with a > > module of 97X,XXX, > > #they went through 10 modules in total. > > > > A[17,2] > > #is 6, which means that there was 6 calls in total that began with a > > 97X,XXX module. > > > > #Hence, > > > > > > A[17,1]/A[17,2] > > > > #is the average number of modules that were executed in all the calls > > that began with a 97X,XXX module. > > > > > > ----- > > ---- > > > > Isaac > > Research Assistant > > Quantitative Finance Faculty, UTS > > I don't see any need for you to use data frames. > > If you make A and data (not a good use of a variable name) just matrices, > you get the same answers at about 10 times the speed (using your example). > Further, you should calculate your rowname, namely: paste(strsplit(as.character(L[i+1,2]),"")[[1]][1:2],sep="",collapse="") only once each loop, instead of 4 times. this saves another 25-30% cputime.
And you can combine the two updates into a single assignment. So using the code: L <- as.matrix(data) A <- array(0, dim=c(19, 2)); rownames(A) <- seq(81, 99, 1) # A <- data.frame(A) for(i in 1:(nrow(L)-1)) { if(L[(i+1),1]!=L[i,1]) { myrow <- paste(strsplit(as.character(L[i+1, 2]), "")[[1]][1:2], sep="", collapse="") A[myrow, ] <- A[myrow, ] + c(length(grep(as.character(L[i+1, 1]), L[, 1], value=FALSE)), 1) } } is 15 times as fast as your original code. > Hope this helps, > Ray Brownrigg > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, > minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.