Hi all,

*We have RDD(main) of sorted time-series data. We want to split it into
different RDDs according to window size and then perform some aggregation
operation like max, min etc. over each RDD in parallel.*

If window size is w then ith RDD has data from (startTime + (i-1)*w) to
(startTime + i*w) where startTime is time-stamp of 1st entry in main RDD
and (startTime + (i-1)*w) is greater then last entry of main RDD.

For now, I am using DataFrame and Spark version 1.5.2. Below code is
running sequentially on the data, so execution time is high and resource
utilization is low. Code snippet is given below:








*/*  * aggragator is max* df - Dataframe has sorted timeseries data* start
- first entry of DataFrame* end - last entry of DataFrame df*
bucketLengthSec - window size* stepResults - has particular block/window
output(JSON)* appendResults - has output till this block/window(JSON)*/*













*while (start <= end) {    row = df.filter(df.col("timeStamp")
.between(start, nextStart))            .agg(max(df.col("timeStamp")),
max(df.col("value")))            .first();    if (row.get(0) != null)
{        stepResults = new JSONObject();        stepResults.put("x",
Long.parseLong(row.get(0).toString()));        stepResults.put("y",
row.get(1));        appendResults.add(stepResults);    }    start =
nextStart;    nextStart = start + bucketLengthSec;}*


-- 
Thanks and Regards,
Arun Verma

Reply via email to