its for 1 day events in range of 1 billions and processing is in streaming application of ~10-15 sec interval so lookup should be fast. RDD need to be updated with new events and old events of current time-24 hours back should be removed at each processing.
So is spark RDD not fit for this requirement? On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <[email protected]> wrote: > What the throughput of processing and for how long do you need to remember > duplicates? > > You can take all the events, put them in an RDD, group by the key, and > then process each key only once. > But if you have a long running application where you want to check that > you didn't see the same value before, and check that for every value, you > probably need a key-value store, not RDD. > > On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <[email protected]> > wrote: > >> Hi >> >> I have a requirement for processing large events but ignoring duplicate >> at the same time. >> >> Events are consumed from kafka and each event has a eventid. It may >> happen that an event is already processed and came again at some other >> offset. >> >> 1.Can I use Spark RDD to persist processed events and then lookup with >> this rdd (How to do lookup inside a RDD ?I have a >> JavaPairRDD<eventid,timestamp> ) >> while processing new events and if event is present in persisted rdd >> ignore it , else process the even. Does rdd.lookup(key) on billion of >> events will be efficient ? >> >> 2. update the rdd (Since RDD is immutable how to update it)? >> >> Thanks >> >>
