Re: Fwd: External Talk: Apache Flink - Speakers: Kostas Tzoumas (CEO dataArtisans), Stephan Ewen (CTO dataArtisans)

Sebastian Tue, 07 Apr 2015 04:30:28 -0700

How large are the individual time series?

-s


On 07.04.2015 12:42, Kostas Tzoumas wrote:

Hi everyone,

I'm forwarding a private conversation to the list with Mats' approval.

The problem is how to compute correlation between time series in Flink.
We have two time series, U and V, and need to compute 1000 correlation
measures between the series, each measure shifts one series by one more
item: corr(U[0:N], V[n:N+n]) for n=0 to n=1000.

Any ideas on how one can do that without a Cartesian product?

Best,
Kostas

---------- Forwarded message ----------
From: *Mats Zachrison* <mats.zachri...@ericsson.com
<mailto:mats.zachri...@ericsson.com>>
Date: Tue, Mar 31, 2015 at 9:21 AM
Subject:
To: Kostas Tzoumas <kos...@data-artisans.com
<mailto:kos...@data-artisans.com>>, Stefan Avesand
<stefan.aves...@ericsson.com <mailto:stefan.aves...@ericsson.com>>
Cc: "step...@data-artisans.com <mailto:step...@data-artisans.com>"
<step...@data-artisans.com <mailto:step...@data-artisans.com>>

As Stefan said, what I’m trying to achieve is basically a nice way to do
a correlation between two large time series. Since I’m looking for an
optimal delay between the two series, I’d like to delay one of the
series x observations when doing the correlation, and step x from 1 to
1000.____

__ __

Some pseudo code:____

__ __

   For (x = 1 to 1000)____

       Shift Series A ‘x-1’ steps____

       Correlation[x] = Correlate(Series A and Series B)____

   End For____

__ __

In R, using cor() and apply(), this could look like:____

__ __

   shift <- as.array(c(1:1000))____

   corrAB <- apply(shift, 1, function(x) cor(data[x:nrow(data),
]$ColumnA, data[1:(nrow(data) - (x - 1)), ]$ColumnB))____

__ __

__ __

Since this basically is 1000 independent correlation calculations, it is
fairly easy to parallelize. Here is an R example using foreach() and
package doParallel:____

__ __

   cl <- makeCluster(3)____

   registerDoParallel(cl)____

   corrAB <- foreach(step = c(1:1000)) %dopar% {____

         corrAB <- cor(data[step:nrow(data), ]$ColumnA,
data[1:(nrow(data) - (step - 1)), ]$ColumnB)____

   }____

   stopCluster(cl)____

__ __

So I guess the question is – how to do this in a Flink environment? Do
we have to define how to parallelize the algorithm, or can the cluster
take care of that for us?____

__ __

And of course this is most interesting on a generic level – given the
environment of a multi-core or –processor setup running Flink, how hard
is it to take advantage of all the clock cycles? Do we have to split the
algorithm, and data, and distribute the processing, or can the system do
much of that for us?____

__


__ __

__

Re: Fwd: External Talk: Apache Flink - Speakers: Kostas Tzoumas (CEO dataArtisans), Stephan Ewen (CTO dataArtisans)

Reply via email to