How large are the individual time series? -s
On 07.04.2015 12:42, Kostas Tzoumas wrote:
Hi everyone, I'm forwarding a private conversation to the list with Mats' approval. The problem is how to compute correlation between time series in Flink. We have two time series, U and V, and need to compute 1000 correlation measures between the series, each measure shifts one series by one more item: corr(U[0:N], V[n:N+n]) for n=0 to n=1000. Any ideas on how one can do that without a Cartesian product? Best, Kostas ---------- Forwarded message ---------- From: *Mats Zachrison* <mats.zachri...@ericsson.com <mailto:mats.zachri...@ericsson.com>> Date: Tue, Mar 31, 2015 at 9:21 AM Subject: To: Kostas Tzoumas <kos...@data-artisans.com <mailto:kos...@data-artisans.com>>, Stefan Avesand <stefan.aves...@ericsson.com <mailto:stefan.aves...@ericsson.com>> Cc: "step...@data-artisans.com <mailto:step...@data-artisans.com>" <step...@data-artisans.com <mailto:step...@data-artisans.com>> As Stefan said, what I’m trying to achieve is basically a nice way to do a correlation between two large time series. Since I’m looking for an optimal delay between the two series, I’d like to delay one of the series x observations when doing the correlation, and step x from 1 to 1000.____ __ __ Some pseudo code:____ __ __ For (x = 1 to 1000)____ Shift Series A ‘x-1’ steps____ Correlation[x] = Correlate(Series A and Series B)____ End For____ __ __ In R, using cor() and apply(), this could look like:____ __ __ shift <- as.array(c(1:1000))____ corrAB <- apply(shift, 1, function(x) cor(data[x:nrow(data), ]$ColumnA, data[1:(nrow(data) - (x - 1)), ]$ColumnB))____ __ __ __ __ Since this basically is 1000 independent correlation calculations, it is fairly easy to parallelize. Here is an R example using foreach() and package doParallel:____ __ __ cl <- makeCluster(3)____ registerDoParallel(cl)____ corrAB <- foreach(step = c(1:1000)) %dopar% {____ corrAB <- cor(data[step:nrow(data), ]$ColumnA, data[1:(nrow(data) - (step - 1)), ]$ColumnB)____ }____ stopCluster(cl)____ __ __ So I guess the question is – how to do this in a Flink environment? Do we have to define how to parallelize the algorithm, or can the cluster take care of that for us?____ __ __ And of course this is most interesting on a generic level – given the environment of a multi-core or –processor setup running Flink, how hard is it to take advantage of all the clock cycles? Do we have to split the algorithm, and data, and distribute the processing, or can the system do much of that for us?____ __ __ __ __