'this 2-node replication is mainly for failover in case the receiver dies
while data is in flight.  there's still chance for data loss as there's no
write ahead log on the hot path, but this is being addressed.'

Can you comment a little on how this will be addressed, will there be a
durable WAL?  Is there a JIRA for tracking this effort?

I am curious without WAL if you can avoid this data loss with explicit
management of Kafka offsets e.g. don't commit offset unless data is
replicated to multiple nodes or maybe not until processed.  The incoming
data will always be durably stored to disk in Kafka so can be replayed in
failure scenarios to avoid data loss if the offsets are managed properly.




On Thu, Aug 28, 2014 at 12:02 PM, Chris Fregly <ch...@fregly.com> wrote:

> @bharat-
>
> overall, i've noticed a lot of confusion about how Spark Streaming scales
> - as well as how it handles failover and checkpointing, but we can discuss
> that separately.
>
> there's actually 2 dimensions to scaling here:  receiving and processing.
>
> *Receiving*
> receiving can be scaled out by submitting new DStreams/Receivers to the
> cluster as i've done in the Kinesis example.  in fact, i purposely chose to
> submit multiple receivers in my Kinesis example because i feel it should be
> the norm and not the exception - particularly for partitioned and
> checkpoint-capable streaming systems like Kafka and Kinesis.   it's the
> only way to scale.
>
> a side note here is that each receiver running in the cluster will
> immediately replicates to 1 other node for fault-tolerance of that specific
> receiver.  this is where the confusion lies.  this 2-node replication is
> mainly for failover in case the receiver dies while data is in flight.
>  there's still chance for data loss as there's no write ahead log on the
> hot path, but this is being addressed.
>
> this in mentioned in the docs here:
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
>
> *Processing*
> once data is received, tasks are scheduled across the Spark cluster just
> like any other non-streaming task where you can specify the number of
> partitions for reduces, etc.  this is the part of scaling that is sometimes
> overlooked - probably because it "works just like regular Spark", but it is
> worth highlighting.
>
> Here's a blurb in the docs:
> https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-processing
>
> the other thing that's confusing with Spark Streaming is that in Scala,
> you need to explicitly
>
> import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
>
> in order to pick up the implicits that allow DStream.reduceByKey and such
> (versus DStream.transform(rddBatch => rddBatch.reduceByKey())
>
> in other words, DStreams appear to be relatively featureless until you
> discover this implicit.  otherwise, you need to operate on the underlying
> RDD's explicitly which is not ideal.
>
> the Kinesis example referenced earlier in the thread uses the DStream
> implicits.
>
>
> side note to all of this - i've recently convinced my publisher for my
> upcoming book, Spark In Action, to let me jump ahead and write the Spark
> Streaming chapter ahead of other more well-understood libraries.  early
> release is in a month or so.  sign up  @ http://sparkinaction.com if you
> wanna get notified.
>
> shameless plug that i wouldn't otherwise do, but i really think it will
> help clear a lot of confusion in this area as i hear these questions asked
> a lot in my talks and such.  and i think a clear, crisp story on scaling
> and fault-tolerance will help Spark Streaming's adoption.
>
> hope that helps!
>
> -chris
>
>
>
>
> On Wed, Aug 27, 2014 at 6:32 PM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> I agree. This issue should be fixed in Spark rather rely on replay of
>> Kafka messages.
>>
>> Dib
>> On Aug 28, 2014 6:45 AM, "RodrigoB" <rodrigo.boav...@aspect.com> wrote:
>>
>>> Dibyendu,
>>>
>>> Tnks for getting back.
>>>
>>> I believe you are absolutely right. We were under the assumption that the
>>> raw data was being computed again and that's not happening after further
>>> tests. This applies to Kafka as well.
>>>
>>> The issue is of major priority fortunately.
>>>
>>> Regarding your suggestion, I would maybe prefer to have the problem
>>> resolved
>>> within Spark's internals since once the data is replicated we should be
>>> able
>>> to access it once more and not having to pool it back again from Kafka or
>>> any other stream that is being affected by this issue. If for example
>>> there
>>> is a big amount of batches to be recomputed I would rather have them done
>>> distributed than overloading the batch interval with huge amount of Kafka
>>> messages.
>>>
>>> I do not have yet enough know how on where is the issue and about the
>>> internal Spark code so I can't really how much difficult will be the
>>> implementation.
>>>
>>> tnks,
>>> Rod
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>

Reply via email to