Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

will-ob Thu, 14 May 2015 13:45:14 -0700

Hey Cody (et. al.),

Few more questions related to this. It sounds like our missing data issues
appear fixed with this approach. Could you shed some light on a few
questions that came up?

---------------------

Processing our data inside a single foreachPartition function appears to be
very different from the pattern seen in the programming guide. Does this
become problematic with additional, interleaved reduce/filter/map steps?

```
# typical?
rdd
.map { ... }
.reduce { ... }
.filter { ... }
.reduce { ... }
.foreachRdd { writeToDb }

# with foreachPartition
rdd.foreachPartition { case (iter) =>
iter
.map { ... }
.reduce { ... }
.filter { ... }
.reduce { ... }
}

```
---------------------------------

Could the above be simplified by having

one kafka partition per DStream, rather than
one kafka partition per RDD partition

That way, we wouldn't need to do our processing inside each partition as
there would only be one set of kafka metadata to commit.

Presumably, one could `join` DStreams when topic-level aggregates were
needed.

It seems this was the approach of Michael Noll in his blog post.
(http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/)
Although, his primary motivation appears to be maintaining high-throughput /
parallelism rather than kafka metadata.

---------------------------------

>From the blog post:

"... there is no long-running receiver task that occupies a core per stream
regardless of what the message volume is."

Is this because data is retrieved by polling rather than maintaining a
socket? Is it still the case that there is only one receiver process per
DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1
.. else discover the machine's NIC limit?

Can you think of a reason not to do this? Cluster utilization, or the like,
perhaps?

--------------------------------

And seems a silly question, but does `foreachPartition` guarantee that a
single worker will process the passed function? Or might two workers split
the work?

Eg. foreachPartition(f)

Worker 1: f( Iterator[partition 1 records 1 - 50] )
Worker 2: f( Iterator[partition 1 records 51 - 100] )

It is unclear from the scaladocs
(https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD).
But you can imagine, if it is critical that this data be committed in a
single transaction, that two workers will have issues.

-- Will O

--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p12257.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

Reply via email to