Fwd: About Spark Streaming overview in Samza docs

Jay Kreps Tue, 02 Jun 2015 09:49:48 -0700

Hey guys,

Here were some critiques of our system comparison page from Tathagata at
Databricks.

-Jay

---------- Forwarded message ----------
From: Tathagata Das <t...@databricks.com>
Date: Thu, May 14, 2015 at 1:15 PM
Subject: About Spark Streaming overview in Samza docs
To: Jay Kreps <j...@confluent.io>

Hello Jay,

I am not sure if you remember me from our earlier (a year or so) phone
conversation along with Patrick Wendell, so let me introduce myself. I am
Tathagata Das (aka TD), and I am the technical lead behind Spark Streaming.
We had chatted earlier about various topics related to Kafka and I hope we
can chat more about it some time soon.

However, in this mail, I wanted to talk a bit about Samza's description of
Spark Streaming
<http://samza.apache.org/learn/documentation/0.9/comparisons/spark-streaming.html>.
Though I sort of assumed that you are the right person to talk. But that
isnt the case, feel free to redirect me to whoever you think is the best
person for this.

The overview of Spark Streaming is pretty good! I myself would not have
been able to put the high-level architecture of Spark Streaming so
succinctly. That said, there are a few pieces of information that are
outdated and it will be good to update the page to avoid confusion. Here
are some of them.

1.* " Spark Streaming does not gurantee at-least-once or at-most-once
messaging semantics"* - This is outdated information. In Spark 1.2, we
introduced write ahead logs
<https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html>
that can guarantee at least once processing for any reliable source,
despite driver and worker failures. In addition, in Spark 1.3 we introduced a
new way
<https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html>
to process data from Kafka, such that it achieves end-to-end exactly-once
processing if data store updates are idempotent or transactional (BTW, did
I say Kafka is *amazing* which allowed us to do this crazy new approach?).

2. *"Spark Streaming may lose data if the failure happens when the data is
received but not yet replicated to other nodes (also see SPARK-1647)"* -
Again, this has changed in between Spark 1.1 - 1.3. For Flume, we added
Flume polling stream that uses Flume transactions to guarantee that data is
properly replicated or retransmitted on receiver failure. Driver failures
handled by write ahead logs. For Kafka, the new approach does not even need
replication as it treats Kafka like a file system, reading segments of log
as needed.

3. *"it is unsuitable for nondeterministic processing, e.g. a randomized
machine learning algorithm"* - It is incorrect to say that Spark Streaming
is unsuitable. We suggest using deterministic operations only to ensure
that the developers always get the expected results even if there are
failures. Just like MapReduce, there is nothing stopping any user from
implementing a non-determinstic algorithm on Spark Streaming, as long as
the user is aware of its consequence of fault-tolerance guarantees (results
may change due to failures). Furthermore, randomized streaming machine
learning algorithms can still be implemented using deterministic
transformations (using pseudo random numbers, etc.). There are quite a few
random sampling (e.g. RDD.sample()
<https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD>)
and randomized algorithms
<https://databricks.com/blog/2015/01/21/random-forests-and-boosting-in-mllib.html>
in core Spark and MLlib (Spark's machine learning library), and the same
techniques can be used to implement "deterministic" randomized machine
learning algorithms on Spark Streaming.

4. *"When a driver node fails in Spark Streaming, Spark’s standalone
cluster mode will restart the driver node automatically. But it is
currently not supported in YARN and Mesos."* - YARN supports automatically
restarting the AM, global default being at most 1 restart (
yarn.resourcemanager.am.max-attempts
<http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>).
On Mesos, applications are often launched using Marathon
<https://mesosphere.github.io/marathon/docs/>, which also supports
restarting.

5. *"Samza is still young, but has just released version 0.7.0."* -
Incorrect ;)

Sorry for this long post. I am happy to get on phone/hangout/skype with you
if more clarifications are needed. And independent of all this, feel free
to email me about anything anytime.

Thanks!

TD

Fwd: About Spark Streaming overview in Samza docs

Reply via email to