Re: spark as a lookup engine for dedup

Shushant Arora Mon, 27 Jul 2015 01:23:25 -0700

its for 1 day events in range of 1 billions and processing is in streaming
application of ~10-15 sec interval so lookup should be fast.  RDD need to
be updated with new events and old events of current time-24 hours back
should be removed at each processing.


So is spark RDD not fit for this requirement?

On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <[email protected]> wrote:

> What the throughput of processing and for how long do you need to remember
> duplicates?
>
> You can take all the events, put them in an RDD, group by the key, and
> then process each key only once.
> But if you have a long running application where you want to check that
> you didn't see the same value before, and check that for every value, you
> probably need a key-value store, not RDD.
>
> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <[email protected]>
> wrote:
>
>> Hi
>>
>> I have a requirement for processing large events but ignoring duplicate
>> at the same time.
>>
>> Events are consumed from kafka and each event has a eventid. It may
>> happen that an event is already processed and came again at some other
>> offset.
>>
>> 1.Can I use Spark RDD to persist processed events and then lookup with
>> this rdd (How to do lookup inside a RDD ?I have a
>> JavaPairRDD<eventid,timestamp> )
>> while processing new events and if event is present in  persisted rdd
>> ignore it , else process the even. Does rdd.lookup(key) on billion of
>> events will be efficient ?
>>
>> 2. update the rdd (Since RDD is immutable  how to update it)?
>>
>> Thanks
>>
>>

Re: spark as a lookup engine for dedup

Reply via email to