Have you read http://kafka.apache.org/documentation.html#compaction
On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour <julna...@gmail.com> wrote: > Context: Process data coming from Kafka and send back results to Kafka. > > Issue: Each events could take several seconds to process (Work in progress > to improve that). During that time, events (and RDD) do accumulate. > Intermediate events (by key) do not have to be processed, only the last > ones. So when one process finished it would be ideal that Spark Streaming > skip all events that are not the current last ones (by key). > > I'm not sure that the solution could be done using only Spark Streaming > API. As I understand Spark Streaming, DStream RDD will accumulate and be > processed one by one and do not considerate if there are others afterwards. > > Possible solutions: > > Using only Spark Streaming API but I'm not sure how. updateStateByKey > seems to be a solution. But I'm not sure that it will work properly when > DStream RDD accumulate and you have to only process lasts events by key. > > Have two Spark Streaming pipelines. One to get last updated event by key, > store that in a map or a database. The second pipeline processes events > only if they are the last ones as indicate by the other pipeline. > > > Sub questions for the second solution: > > Could two pipelines share the same sparkStreamingContext and process the > same DStream at different speed (low processing vs high)? > > Is it easily possible to share values (map for example) between pipelines > without using an external database? I think accumulator/broadcast could > work but between two pipelines I'm not sure. > > Regards, > > Julien Naour >