This has been discussed in a number of threads in this mailing list. Here is a summary.
1. Processing of batch T+1 always starts after all the processing of batch T has completed. But here a "batch" is defined by data of all the receivers running the in the system receiving within the batch interval. Since all the data is divided internally in blocks and partitions, there is no clear mapping between the original order in the sources and ordering in the RDDs generated by batches. 2. However in the specific case of Direct Kafka stream, since there is a one-to-one mapping between the Kafka partition and RDD partition (of the RDDs generated by the direct kafka stream), there is a per-partitoin ordering guarantee. For example, partition 2 of all the direct Kafka RDDs maps to partition 2 of a Kafka topic, then data of all the data is consecutive RDD partition 2 will be in the same order as they were in Kafka. This is the special case. Here is another relevant thread: http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3ccao05p7de8dpxs5dyfvrni_yzv22s5z26b9jvyayj-r+pwy5...@mail.gmail.com%3E On Sun, Jul 12, 2015 at 8:36 PM, anshu shukla <anshushuk...@gmail.com> wrote: > Anyone who can give some highlight over HOW SPARK DOES *ORDERING OF > BATCHES * . > > On Sat, Jul 11, 2015 at 9:19 AM, anshu shukla <anshushuk...@gmail.com> > wrote: > >> Thanks Ayan , >> >> I was curious to know* how Spark does it *.Is there any *Documentation* >> where i can get the detail about that . Will you please point me out some >> detailed link etc . >> >> May be it does something like *transactional topologies in storm*.( >> https://storm.apache.org/documentation/Transactional-topologies.html) >> >> >> On Sat, Jul 11, 2015 at 9:13 AM, ayan guha <guha.a...@gmail.com> wrote: >> >>> AFAIK, it is guranteed that batch t+1 will not start processing until >>> batch t is done. >>> >>> ordeing within batch - what do you mean by that? In essence, the (mini) >>> batch will get distributed in partitions like a normal RDD, so following >>> rdd.zipWithIndex should give a wy to order them by the time they are >>> received. >>> >>> On Sat, Jul 11, 2015 at 12:50 PM, anshu shukla <anshushuk...@gmail.com> >>> wrote: >>> >>>> Hey , >>>> >>>> Is there any *guarantee of fix ordering among the batches/RDDs* . >>>> >>>> After searching a lot I found there is no ordering by default (from >>>> the framework itself ) not only on *batch wise *but *also ordering >>>> within batches* .But i doubt is there any change from old spark >>>> versions to spark 1.4 in this context. >>>> >>>> Any Comments please !! >>>> >>>> -- >>>> Thanks & Regards, >>>> Anshu Shukla >>>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Thanks & Regards, >> Anshu Shukla >> > > > > -- > Thanks & Regards, > Anshu Shukla >