practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
We're a group of experienced backend developers who are fairly new to Spark Streaming (and Scala) and very interested in using the new (in 1.3) DirectKafkaInputDStream impl as part of the metrics reporting service we're building. Our flow involves reading in metric events, lightly modifying some o

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote > What's your schema for the offset table, and what's the definition of > writeOffset ? The schema is the same as the one in your post: topic | partition| offset The writeOffset is nearly identical: def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote > In fact, you're using the 2 arg form of reduce by key to shrink it down to > 1 partition > > reduceByKey(sumFunc, 1) > > But you started with 4 kafka partitions? So they're definitely no longer > 1:1 True. I added the second arg because we were seeing multiple threads

Error recovery strategies using the DirectKafkaInputDStream

2015-05-14 Thread badgerpants
We've been using the new DirectKafkaInputDStream to implement an exactly once processing solution that tracks the provided offset ranges within the same transaction that persists our data results. When an exception is thrown within the processing loop and the configured number of retries are exhaus