I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose.
We talked a bit about this after his presentation about this - the short
answer is spark streaming does not guarantee any sort of ordering (within
batches, across batches).  One would have to use updateStateByKey to
collect the events and sort them based on some attribute of the event.  But
TD said message ordering is a frequently asked feature recently and is
getting on his radar.

I went through the source code and there does not seem to be any
architectural/design limitation to support this.  (JobScheduler,
JobGenerator are a good starting point to see how stuff works under the
hood).  Overriding DStream#compute and using streaminglistener looks like a
simple way of ensuring ordered execution of batches within a stream. But
this would be a partial solution, since ordering within a batch needs some
more work that I don't understand fully yet.

Side note :  My custom receiver polls the metricsservlet once in a while to
decide whether jobs are getting done fast enough and throttle/relax pushing
data in to receivers based on the numbers provided by metricsservlet. I had
to do this because out-of-the-box rate limiting right now is static and
cannot adapt to the state of the cluster

thnx
-neelesh

On Wed, Feb 18, 2015 at 4:13 PM, jay vyas <jayunit100.apa...@gmail.com>
wrote:

> This is a *fantastic* question.  The idea of how we identify individual
> things in multiple  DStreams is worth looking at.
>
> The reason being, that you can then fine tune your streaming job, based on
> the RDD identifiers (i.e. are the timestamps from the producer correlating
> closely to the order in which RDD elements are being produced) ?  If *NO*
> then you need to (1) dial up throughput on producer sources or else (2)
> increase cluster size so that spark is capable of evenly handling load.
>
> You cant decide to do (1) or (2) unless you can track  when the streaming
> elements are being  converted to RDDs by spark itself.
>
>
>
> On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote:
>
>> There does not seem to be a definitive answer on this. Every time I
>> google for message ordering,the only relevant thing that comes up is this
>>  -
>> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
>> .
>>
>> With a kafka receiver that pulls data from a single kafka partition of a
>> kafka topic, are individual messages in the microbatch in same the order as
>> kafka partition? Are successive microbatches originating from a kafka
>> partition executed in order?
>>
>>
>> Thanks!
>>
>>
>
>
>
> --
> jay vyas
>

Reply via email to