We're a group of experienced backend developers who are fairly new to Spark
Streaming (and Scala) and very interested in using the new (in 1.3)
DirectKafkaInputDStream impl as part of the metrics reporting service we're
building.
Our flow involves reading in metric events, lightly modifying some o
Cody Koeninger-2 wrote
> What's your schema for the offset table, and what's the definition of
> writeOffset ?
The schema is the same as the one in your post: topic | partition| offset
The writeOffset is nearly identical:
def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {
Cody Koeninger-2 wrote
> In fact, you're using the 2 arg form of reduce by key to shrink it down to
> 1 partition
>
> reduceByKey(sumFunc, 1)
>
> But you started with 4 kafka partitions? So they're definitely no longer
> 1:1
True. I added the second arg because we were seeing multiple threads
We've been using the new DirectKafkaInputDStream to implement an exactly once
processing solution that tracks the provided offset ranges within the same
transaction that persists our data results. When an exception is thrown
within the processing loop and the configured number of retries are
exhaus