Hi All, I appreciate the help :)
Here is a sample code where I am trying to keep the data of the previous RDD and the current RDD in a foreachRDD in spark stream. I do not know if the bottom code technically works as I cannot compile it , but I am trying to in a way keep the historical reference of the last RDD in this scenario. This is the furthest I got. You can imagine another scenario where I keep historical list where if I get a certain "order" of events, I store them. sortedtsStream.foreach(new ABC()); //error here cannot be referenced from static context, this call is within static main() class ABC implements Function<JavaPairRDD<Tuple2<Long, Integer>, Integer>, Void>{ @Override public Void call(JavaPairRDD<Tuple2<Long, Integer>, Integer> tuple2IntegerJavaPairRDD) throws Exception { List<Tuple2<Tuple2<Long, Integer>, Integer>> list = tuple2IntegerJavaPairRDD.collect(); if(Type4ViolationChecker.this.prevlist!=null && currentlist!=null){ prevlist = currentlist; currentlist = list; } else{ currentlist = list; prevlist = list; } System.out.println("Printing previous"); for (Tuple2<Tuple2<Long, Integer>, Integer> tuple : prevlist) { Date date = new Date(tuple._1._1); int pattern = tuple._1._2; int count = tuple._2; System.out.println("TimeSlot: " + date.toString() + " Pattern: " + pattern + " Count: " + count); } System.out.println("Printing current"); for (Tuple2<Tuple2<Long, Integer>, Integer> tuple : currentlist) { Date date = new Date(tuple._1._1); int pattern = tuple._1._2; int count = tuple._2; System.out.println("TimeSlot: " + date.toString() + " Pattern: " + pattern + " Count: " + count); } return null; } } Thanks Nipun On Thu, Jun 18, 2015 at 11:26 AM, twinkle sachdeva < twinkle.sachd...@gmail.com> wrote: > Hi, > > UpdateStateByKey : if you can brief the issue you are facing with > this,that will be great. > > Regarding not keeping whole dataset in memory, you can tweak the parameter > of remember, such that it does checkpoint at appropriate time. > > Thanks > Twinkle > > On Thursday, June 18, 2015, Nipun Arora <nipunarora2...@gmail.com> wrote: > >> Hi All, >> >> I am updating my question so that I give more detail. I have also created >> a stackexchange question: >> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming >> >> Is there anyway in spark streaming to keep data across multiple >> micro-batches of a sorted dstream, where the stream is sorted using >> timestamps? (Assuming monotonically arriving data) Can anyone make >> suggestions on how to keep data across iterations where each iteration is >> an RDD being processed in JavaDStream? >> >> *What does iteration mean?* >> >> I first sort the dstream using timestamps, assuming that data has arrived >> in a monotonically increasing timestamp (no out-of-order). >> >> I need a global HashMap X, which I would like to be updated using values >> with timestamp "t1", and then subsequently "t1+1". Since the state of X >> itself impacts the calculations it needs to be a linear operation. Hence >> operation at "t1+1" depends on HashMap X, which depends on data at and >> before "t1". >> >> *Application* >> >> This is especially the case when one is trying to update a model or >> compare two sets of RDD's, or keep a global history of certain events etc >> which will impact operations in future iterations? >> >> I would like to keep some accumulated history to make calculations.. not >> the entire dataset, but persist certain events which can be used in future >> DStream RDDs? >> >> Thanks >> Nipun >> >> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora <nipunarora2...@gmail.com> >> wrote: >> >>> Hi Silvio, >>> >>> Thanks for your response. >>> I should clarify. I would like to do updates on a structure iteratively. >>> I am not sure if updateStateByKey meets my criteria. >>> >>> In the current situation, I can run some map reduce tasks and generate a >>> JavaPairDStream<Key,Value>, after this my algorithm is necessarily >>> sequential ... i.e. I have sorted the data using the timestamp(within the >>> messages), and I would like to iterate over it, and maintain a state where >>> I can update a model. >>> >>> I tried using foreach/foreachRDD, and collect to do this, but I can't >>> seem to propagate values across microbatches/RDD's. >>> >>> Any suggestions? >>> >>> Thanks >>> Nipun >>> >>> >>> >>> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito < >>> silvio.fior...@granturing.com> wrote: >>> >>>> Hi, just answered in your other thread as well... >>>> >>>> Depending on your requirements, you can look at the updateStateByKey >>>> API >>>> >>>> From: Nipun Arora >>>> Date: Wednesday, June 17, 2015 at 10:51 PM >>>> To: "user@spark.apache.org" >>>> Subject: Iterative Programming by keeping data across micro-batches in >>>> spark-streaming? >>>> >>>> Hi, >>>> >>>> Is there anyway in spark streaming to keep data across multiple >>>> micro-batches? Like in a HashMap or something? >>>> Can anyone make suggestions on how to keep data across iterations where >>>> each iteration is an RDD being processed in JavaDStream? >>>> >>>> This is especially the case when I am trying to update a model or >>>> compare two sets of RDD's, or keep a global history of certain events etc >>>> which will impact operations in future iterations? >>>> I would like to keep some accumulated history to make calculations.. >>>> not the entire dataset, but persist certain events which can be used in >>>> future JavaDStream RDDs? >>>> >>>> Thanks >>>> Nipun >>>> >>> >>> >>