I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose. We talked a bit about this after his presentation about this - the short answer is spark streaming does not guarantee any sort of ordering (within batches, across batches). One would have to use updateStateByKey to collect the events and sort them based on some attribute of the event. But TD said message ordering is a frequently asked feature recently and is getting on his radar.
I went through the source code and there does not seem to be any architectural/design limitation to support this. (JobScheduler, JobGenerator are a good starting point to see how stuff works under the hood). Overriding DStream#compute and using streaminglistener looks like a simple way of ensuring ordered execution of batches within a stream. But this would be a partial solution, since ordering within a batch needs some more work that I don't understand fully yet. Side note : My custom receiver polls the metricsservlet once in a while to decide whether jobs are getting done fast enough and throttle/relax pushing data in to receivers based on the numbers provided by metricsservlet. I had to do this because out-of-the-box rate limiting right now is static and cannot adapt to the state of the cluster thnx -neelesh On Wed, Feb 18, 2015 at 4:13 PM, jay vyas <jayunit100.apa...@gmail.com> wrote: > This is a *fantastic* question. The idea of how we identify individual > things in multiple DStreams is worth looking at. > > The reason being, that you can then fine tune your streaming job, based on > the RDD identifiers (i.e. are the timestamps from the producer correlating > closely to the order in which RDD elements are being produced) ? If *NO* > then you need to (1) dial up throughput on producer sources or else (2) > increase cluster size so that spark is capable of evenly handling load. > > You cant decide to do (1) or (2) unless you can track when the streaming > elements are being converted to RDDs by spark itself. > > > > On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote: > >> There does not seem to be a definitive answer on this. Every time I >> google for message ordering,the only relevant thing that comes up is this >> - >> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html >> . >> >> With a kafka receiver that pulls data from a single kafka partition of a >> kafka topic, are individual messages in the microbatch in same the order as >> kafka partition? Are successive microbatches originating from a kafka >> partition executed in order? >> >> >> Thanks! >> >> > > > > -- > jay vyas >