Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

Nipun Arora Thu, 18 Jun 2015 08:41:39 -0700

Hi All,

I appreciate the help :)


Here is a sample code where I am trying to keep the data of the previous
RDD and the current RDD in a foreachRDD in spark stream.
I do not know if the bottom code technically works as I cannot compile it ,
but I am trying to in a way keep the historical reference of the last RDD
in this scenario.
This is the furthest I got.

You can imagine another scenario where I keep historical list where if I
get a certain "order" of events, I store them.

sortedtsStream.foreach(new ABC()); //error here cannot be referenced
from static context, this call is within static main()

class ABC implements Function<JavaPairRDD<Tuple2<Long, Integer>,
Integer>, Void>{


    @Override
    public Void call(JavaPairRDD<Tuple2<Long, Integer>, Integer>
tuple2IntegerJavaPairRDD) throws Exception {
        List<Tuple2<Tuple2<Long, Integer>, Integer>> list =
tuple2IntegerJavaPairRDD.collect();

        if(Type4ViolationChecker.this.prevlist!=null && currentlist!=null){
            prevlist = currentlist;
            currentlist = list;
        }
        else{
            currentlist = list;
            prevlist = list;
        }

        System.out.println("Printing previous");
        for (Tuple2<Tuple2<Long, Integer>, Integer> tuple : prevlist) {
            Date date = new Date(tuple._1._1);
            int pattern = tuple._1._2;
            int count = tuple._2;
            System.out.println("TimeSlot: " + date.toString() + "
Pattern: " + pattern + " Count: " + count);
        }



        System.out.println("Printing current");
        for (Tuple2<Tuple2<Long, Integer>, Integer> tuple : currentlist) {
            Date date = new Date(tuple._1._1);
            int pattern = tuple._1._2;
            int count = tuple._2;
            System.out.println("TimeSlot: " + date.toString() + "
Pattern: " + pattern + " Count: " + count);
        }

        return null;
    }
}


Thanks
Nipun

On Thu, Jun 18, 2015 at 11:26 AM, twinkle sachdeva <
twinkle.sachd...@gmail.com> wrote:

> Hi,
>
>  UpdateStateByKey : if you can brief the issue you are facing with
> this,that will be great.
>
> Regarding not keeping whole dataset in memory, you can tweak the parameter
> of remember, such that it does checkpoint at appropriate time.
>
> Thanks
> Twinkle
>
> On Thursday, June 18, 2015, Nipun Arora <nipunarora2...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am updating my question so that I give more detail. I have also created
>> a stackexchange question:
>> http://stackoverflow.com/questions/30904244/iterative-programming-on-an-ordered-spark-stream-using-java-in-spark-streaming
>>
>> Is there anyway in spark streaming to keep data across multiple
>> micro-batches of a sorted dstream, where the stream is sorted using
>> timestamps? (Assuming monotonically arriving data) Can anyone make
>> suggestions on how to keep data across iterations where each iteration is
>> an RDD being processed in JavaDStream?
>>
>> *What does iteration mean?*
>>
>> I first sort the dstream using timestamps, assuming that data has arrived
>> in a monotonically increasing timestamp (no out-of-order).
>>
>> I need a global HashMap X, which I would like to be updated using values
>> with timestamp "t1", and then subsequently "t1+1". Since the state of X
>> itself impacts the calculations it needs to be a linear operation. Hence
>> operation at "t1+1" depends on HashMap X, which depends on data at and
>> before "t1".
>>
>> *Application*
>>
>> This is especially the case when one is trying to update a model or
>> compare two sets of RDD's, or keep a global history of certain events etc
>> which will impact operations in future iterations?
>>
>> I would like to keep some accumulated history to make calculations.. not
>> the entire dataset, but persist certain events which can be used in future
>> DStream RDDs?
>>
>> Thanks
>> Nipun
>>
>> On Wed, Jun 17, 2015 at 11:15 PM, Nipun Arora <nipunarora2...@gmail.com>
>> wrote:
>>
>>> Hi Silvio,
>>>
>>> Thanks for your response.
>>> I should clarify. I would like to do updates on a structure iteratively.
>>> I am not sure if updateStateByKey meets my criteria.
>>>
>>> In the current situation, I can run some map reduce tasks and generate a
>>> JavaPairDStream<Key,Value>, after this my algorithm is necessarily
>>> sequential ... i.e. I have sorted the data using the timestamp(within the
>>> messages), and I would like to iterate over it, and maintain a state where
>>> I can update a model.
>>>
>>> I tried using foreach/foreachRDD, and collect to do this, but I can't
>>> seem to propagate values across microbatches/RDD's.
>>>
>>> Any suggestions?
>>>
>>> Thanks
>>> Nipun
>>>
>>>
>>>
>>> On Wed, Jun 17, 2015 at 10:52 PM, Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
>>>>   Hi, just answered in your other thread as well...
>>>>
>>>>  Depending on your requirements, you can look at the updateStateByKey
>>>> API
>>>>
>>>>   From: Nipun Arora
>>>> Date: Wednesday, June 17, 2015 at 10:51 PM
>>>> To: "user@spark.apache.org"
>>>> Subject: Iterative Programming by keeping data across micro-batches in
>>>> spark-streaming?
>>>>
>>>>   Hi,
>>>>
>>>>  Is there anyway in spark streaming to keep data across multiple
>>>> micro-batches? Like in a HashMap or something?
>>>> Can anyone make suggestions on how to keep data across iterations where
>>>> each iteration is an RDD being processed in JavaDStream?
>>>>
>>>> This is especially the case when I am trying to update a model or
>>>> compare two sets of RDD's, or keep a global history of certain events etc
>>>> which will impact operations in future iterations?
>>>> I would like to keep some accumulated history to make calculations..
>>>> not the entire dataset, but persist certain events which can be used in
>>>> future JavaDStream RDDs?
>>>>
>>>>  Thanks
>>>> Nipun
>>>>
>>>
>>>
>>

Re: [Spark Streaming] Iterative programming on an ordered spark stream using Java?

Reply via email to