Thanks Cody again for your answer.
The idea here is to process all events but only launch the big job (that is
longer than the batch size) if they are the last events for an id
considering the current state of data. Knowing if they are the last is my
issue in fact.
So I think I need two jobs. One
If your job consistently takes longer than the batch time to process, you
will keep lagging longer and longer behind. That's not sustainable, you
need to increase batch sizes or decrease processing time. In your case,
probably increase batch size, since you're pre-filtering it down to only 1
even
The following lines are my understanding of Spark Streaming AFAIK, I could
be wrong:
Spark Streaming processes data from a Stream in micro-batch, one at a time.
When a process takes time, DStream's RDD are accumulated.
So in my case (my process takes time) DStream's RDD are accumulated. What I
wan
if you don't have hot users, you can use the user id as the hash key for
publishing into kafka.
That will put all events for a given user in the same partition per batch.
Then you can do foreachPartition with a local map to store just a single
event per user, e.g.
foreachPartition { p =>
val m =
Thanks for your answer,
As I understand it, a consumer that stays caught-up will read every message
even with compaction. So for a pure Kafka Spark Streaming It will not be a
solution.
Perhaps I could reconnect to the Kafka topic after each process to get the
last state of events and then compare
Have you read
http://kafka.apache.org/documentation.html#compaction
On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour wrote:
> Context: Process data coming from Kafka and send back results to Kafka.
>
> Issue: Each events could take several seconds to process (Work in progress
> to improve that).
Context: Process data coming from Kafka and send back results to Kafka.
Issue: Each events could take several seconds to process (Work in progress
to improve that). During that time, events (and RDD) do accumulate.
Intermediate events (by key) do not have to be processed, only the last
ones. So wh