Re: Ordering of Batches in Spark streaming

Tathagata Das Tue, 14 Jul 2015 21:17:59 -0700

This has been discussed in a number of threads in this mailing list. Here
is a summary.

1. Processing of batch T+1 always starts after all the processing of batch
T has completed. But here a "batch" is defined by data of all the receivers
running the in the system receiving within the batch interval. Since all
the data is divided internally in blocks and partitions, there is no clear
mapping between the original order in the sources and ordering in the RDDs
generated by batches.

2. However in the specific case of Direct Kafka stream, since there is a
one-to-one mapping between the Kafka partition and RDD partition (of the
RDDs generated by the direct kafka stream), there is a per-partitoin
ordering guarantee. For example, partition 2 of all the direct Kafka RDDs
maps to partition 2 of a Kafka topic, then data of all the data is
consecutive RDD partition 2 will be in the same order as they were in
Kafka. This is the special case.

Here is another relevant thread:
http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccao05p7de8dpxs5dyfvrni_yzv22s5z26b9jvyayj-r+pwy5...@mail.gmail.com%3E

On Sun, Jul 12, 2015 at 8:36 PM, anshu shukla <anshushuk...@gmail.com>
wrote:

> Anyone   who can give some highlight over  HOW SPARK DOES *ORDERING OF
> BATCHES * .
>
> On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla <anshushuk...@gmail.com>
> wrote:
>
>> Thanks Ayan ,
>>
>> I was curious to know* how Spark does it *.Is there  any  *Documentation*
>> where i can get the detail about that . Will you please point me out some
>> detailed link etc .
>>
>> May be it does something like *transactional topologies in storm*.(
>> https://storm.apache.org/documentation/Transactional-topologies.html)
>>
>>
>> On Sat, Jul 11, 2015 at 9:13 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> AFAIK, it is guranteed that batch t+1 will not start processing until
>>> batch t is done.
>>>
>>> ordeing within batch - what do you mean by that? In essence, the (mini)
>>> batch will get distributed in partitions like a normal RDD, so following
>>> rdd.zipWithIndex should give a wy to order them by the time they are
>>> received.
>>>
>>> On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla <anshushuk...@gmail.com>
>>> wrote:
>>>
>>>> Hey ,
>>>>
>>>> Is there any *guarantee of fix  ordering among the batches/RDDs* .
>>>>
>>>> After searching  a lot  I found there is no ordering  by default (from
>>>> the framework itself ) not only on *batch wise *but *also ordering
>>>>  within   batches* .But i doubt  is there any change from old spark
>>>> versions to spark 1.4 in this context.
>>>>
>>>> Any  Comments please !!
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Anshu Shukla
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>> Anshu Shukla
>>
>
>
>
> --
> Thanks & Regards,
> Anshu Shukla
>

Re: Ordering of Batches in Spark streaming

Reply via email to