Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks Cody again for your answer. The idea here is to process all events but only launch the big job (that is longer than the batch size) if they are the last events for an id considering the current state of data. Knowing if they are the last is my issue in fact. So I think I need two jobs. One

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
If your job consistently takes longer than the batch time to process, you will keep lagging longer and longer behind. That's not sustainable, you need to increase batch sizes or decrease processing time. In your case, probably increase batch size, since you're pre-filtering it down to only 1 even

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
The following lines are my understanding of Spark Streaming AFAIK, I could be wrong: Spark Streaming processes data from a Stream in micro-batch, one at a time. When a process takes time, DStream's RDD are accumulated. So in my case (my process takes time) DStream's RDD are accumulated. What I wan

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
if you don't have hot users, you can use the user id as the hash key for publishing into kafka. That will put all events for a given user in the same partition per batch. Then you can do foreachPartition with a local map to store just a single event per user, e.g. foreachPartition { p => val m =

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks for your answer, As I understand it, a consumer that stays caught-up will read every message even with compaction. So for a pure Kafka Spark Streaming It will not be a solution. Perhaps I could reconnect to the Kafka topic after each process to get the last state of events and then compare

Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
Have you read http://kafka.apache.org/documentation.html#compaction On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour wrote: > Context: Process data coming from Kafka and send back results to Kafka. > > Issue: Each events could take several seconds to process (Work in progress > to improve that).

Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Context: Process data coming from Kafka and send back results to Kafka. Issue: Each events could take several seconds to process (Work in progress to improve that). During that time, events (and RDD) do accumulate. Intermediate events (by key) do not have to be processed, only the last ones. So wh