Hi, I have a batch daily job that computes daily aggregate of several counters represented by some object. After daily aggregation is done, I want to compute block of 3 days aggregation(3,7,30 etc) To do so I need to add new daily aggregation to the current block and then subtract from current block the daily aggregation of the last day within the current block(sliding window...) I've implemented it with something like: baseBlockRdd.leftjoin(lastDayRdd).map(subtraction).fullOuterJoin(newDayRdd).map(addition) All rdds are keyed by unique id(long). Each rdd is saved in avro files after the job finishes and loaded when job starts(on next day). baseBlockRdd is much larger than lastDay and newDay rdds(depends on the size of the block)
Unfortunately the performance is not satisfactory due to many shuffles(I have parallelism etc) I was looking for the way to improve performance somehow, to make sure that one task "joins" same local keys without reshuffling baseBlockRdd(which is big) each time the job starts(see https://spark-project.atlassian.net/browse/SPARK-1061 as related issue) so bottom line - how to join big rdd with smaller rdd without reshuffling big rdd over and over again? As soon as I've saved this big rdd and reloaded it from disk I want that every other rdd will be partitioned and collocated by the same "partitioner"(which is absent for hadooprdd) ... somehow, so that only small rdds will be sent over network. Another idea I had - somehow split baseBlock into 2 parts with filter by keys of small rdds and then join, however I'm not sure it's possible to implement this filter without join. any ideas would be appreciated, thanks in advance Igor -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Batch-aggregation-by-sliding-window-join-tp23074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org