Hi all, *We have RDD(main) of sorted time-series data. We want to split it into different RDDs according to window size and then perform some aggregation operation like max, min etc. over each RDD in parallel.*
If window size is w then ith RDD has data from (startTime + (i-1)*w) to (startTime + i*w) where startTime is time-stamp of 1st entry in main RDD and (startTime + (i-1)*w) is greater then last entry of main RDD. For now, I am using DataFrame and Spark version 1.5.2. Below code is running sequentially on the data, so execution time is high and resource utilization is low. Code snippet is given below: */* * aggragator is max* df - Dataframe has sorted timeseries data* start - first entry of DataFrame* end - last entry of DataFrame df* bucketLengthSec - window size* stepResults - has particular block/window output(JSON)* appendResults - has output till this block/window(JSON)*/* *while (start <= end) { row = df.filter(df.col("timeStamp") .between(start, nextStart)) .agg(max(df.col("timeStamp")), max(df.col("value"))) .first(); if (row.get(0) != null) { stepResults = new JSONObject(); stepResults.put("x", Long.parseLong(row.get(0).toString())); stepResults.put("y", row.get(1)); appendResults.add(stepResults); } start = nextStart; nextStart = start + bucketLengthSec;}* -- Thanks and Regards, Arun Verma