Have you read

http://kafka.apache.org/documentation.html#compaction



On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour <julna...@gmail.com> wrote:

> Context: Process data coming from Kafka and send back results to Kafka.
>
> Issue: Each events could take several seconds to process (Work in progress
> to improve that). During that time, events (and RDD) do accumulate.
> Intermediate events (by key) do not have to be processed, only the last
> ones. So when one process finished it would be ideal that Spark Streaming
> skip all events that are not the current last ones (by key).
>
> I'm not sure that the solution could be done using only Spark Streaming
> API. As I understand Spark Streaming, DStream RDD will accumulate and be
> processed one by one and do not considerate if there are others afterwards.
>
> Possible solutions:
>
> Using only Spark Streaming API but I'm not sure how. updateStateByKey
> seems to be a solution. But I'm not sure that it will work properly when
> DStream RDD accumulate and you have to only process lasts events by key.
>
> Have two Spark Streaming pipelines. One to get last updated event by key,
> store that in a map or a database. The second pipeline processes events
> only if they are the last ones as indicate by the other pipeline.
>
>
> Sub questions for the second solution:
>
> Could two pipelines share the same sparkStreamingContext and process the
> same DStream at different speed (low processing vs high)?
>
> Is it easily possible to share values (map for example) between pipelines
> without using an external database? I think accumulator/broadcast could
> work but between two pipelines I'm not sure.
>
> Regards,
>
> Julien Naour
>

Reply via email to