I retrieve for a few hundred times a group of time series (10-15 ts with 10000 values each), on every group I do some calculation, graphs etc. I wonder if there is a faster method than what presented below to get an appropriate timeseries object.
Making a query with RODBC for every group I get a data frame like this: > X ID DATE VALUE 14 3 2000-01-01 00:00:03 0.5726334 4 1 2000-01-01 00:00:03 0.8830174 1 1 2000-01-01 00:00:00 0.2875775 15 3 2000-01-01 00:00:04 0.1029247 11 3 2000-01-01 00:00:00 0.9568333 9 2 2000-01-01 00:00:03 0.5514350 7 2 2000-01-01 00:00:01 0.5281055 6 2 2000-01-01 00:00:00 0.0455565 12 3 2000-01-01 00:00:01 0.4533342 8 2 2000-01-01 00:00:02 0.8924190 3 1 2000-01-01 00:00:02 0.4089769 13 3 2000-01-01 00:00:02 0.6775706 And I want to get a timeSeries object or xts object like this: 1 2 3 2000-01-01 00:00:00 0.2875775 0.0455565 0.9568333 2000-01-01 00:00:01 NA 0.5281055 0.4533342 2000-01-01 00:00:02 0.4089769 0.8924190 0.6775706 2000-01-01 00:00:03 0.8830174 0.5514350 0.5726334 2000-01-01 00:00:04 NA NA 0.1029247 Input data can be sorted or unsorted (the most complicated case is in the example, unsorted and missing data) in the sense that I can sort in query if I can take an advantage from this. Some considerations: - Xts is generally faster than timeSeries - both accept a matrix so if I can create a matrix like the one represented above and an array of characters representing dates faster than what possible with xts:::cbind, for examole,I will have a faster implementation (package data.table ?). - create timeseries objects in multithread and then merge (package plyr ?) - faster merge algorithms? Below some code to generate the test case above: set.seed(123) N <- 5 # number of observations K <- 3 # number of timeseries ID X <- data.frame( ID = rep(1:K, each = N), DATE = as.character(rep(as.POSIXct("2000-01-01", tz = "GMT")+ 0:(N-1), K)), VALUE = runif(N*K), stringsAsFactors = FALSE) X <- X[sample(1:(N*K), N*K),] # sample observations to get random order (optional) X <- X[-(sample(1:nrow(X), floor(nrow(X)*0.2))),] # 20% missing head(X, 15) # use explicitly environments to avoid '<<-' buildTimeSeriesFromDataFrame <- function(x, env) { { if(exists("xx", envir = env)) # if exist variable xx in env cbind assign("xx", cbind(get("xx", env), timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', zone = 'GMT', units = as.character(x$ID[1]))), envir = env) else # create xx in env assign("xx", timeSeries(x$VALUE, x$DATE, format = '%Y-%m-%d %H:%M:%S', zone = 'GMT', units = as.character(x$ID[1])), envir = env) return(TRUE) } } # use package plyr, faster than 'by' function tsDaply <- function(...) { library(plyr) e1 <- new.env(parent = baseenv()) #create a new env res <- daply(X, "ID", buildTimeSeriesFromDataFrame, env = e1) return(get("xx", e1)) # return xx from env } ##replicate 100 times #Time03 <- replicate(100, # system.time(tsDaply(X, X$ID))[[1]]) #median(Time03) # result tsDaply(X, X$ID) Thanks in advance for any input, best regards, Den ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.