Kafka ordering is guaranteed on a per-partition basis.

The high-level consumer api as used by the spark kafka streams prior to 1.3
will consume from multiple kafka partitions, thus not giving any ordering
guarantees.

The experimental direct stream in 1.3 uses the "simple" consumer api, and
there is a 1:1 correspondence between spark partitions and kafka
partitions.  So you will get deterministic ordering, but only on a
per-partition basis.

On Thu, Feb 19, 2015 at 11:31 PM, Neelesh <neele...@gmail.com> wrote:

> I had a chance to talk to TD today at the Strata+Hadoop Conf in San Jose.
> We talked a bit about this after his presentation about this - the short
> answer is spark streaming does not guarantee any sort of ordering (within
> batches, across batches).  One would have to use updateStateByKey to
> collect the events and sort them based on some attribute of the event.  But
> TD said message ordering is a frequently asked feature recently and is
> getting on his radar.
>
> I went through the source code and there does not seem to be any
> architectural/design limitation to support this.  (JobScheduler,
> JobGenerator are a good starting point to see how stuff works under the
> hood).  Overriding DStream#compute and using streaminglistener looks like a
> simple way of ensuring ordered execution of batches within a stream. But
> this would be a partial solution, since ordering within a batch needs some
> more work that I don't understand fully yet.
>
> Side note :  My custom receiver polls the metricsservlet once in a while
> to decide whether jobs are getting done fast enough and throttle/relax
> pushing data in to receivers based on the numbers provided by
> metricsservlet. I had to do this because out-of-the-box rate limiting right
> now is static and cannot adapt to the state of the cluster
>
> thnx
> -neelesh
>
> On Wed, Feb 18, 2015 at 4:13 PM, jay vyas <jayunit100.apa...@gmail.com>
> wrote:
>
>> This is a *fantastic* question.  The idea of how we identify individual
>> things in multiple  DStreams is worth looking at.
>>
>> The reason being, that you can then fine tune your streaming job, based
>> on the RDD identifiers (i.e. are the timestamps from the producer
>> correlating closely to the order in which RDD elements are being produced)
>> ?  If *NO* then you need to (1) dial up throughput on producer sources or
>> else (2) increase cluster size so that spark is capable of evenly handling
>> load.
>>
>> You cant decide to do (1) or (2) unless you can track  when the streaming
>> elements are being  converted to RDDs by spark itself.
>>
>>
>>
>> On Wed, Feb 18, 2015 at 6:54 PM, Neelesh <neele...@gmail.com> wrote:
>>
>>> There does not seem to be a definitive answer on this. Every time I
>>> google for message ordering,the only relevant thing that comes up is this
>>>  -
>>> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
>>> .
>>>
>>> With a kafka receiver that pulls data from a single kafka partition of a
>>> kafka topic, are individual messages in the microbatch in same the order as
>>> kafka partition? Are successive microbatches originating from a kafka
>>> partition executed in order?
>>>
>>>
>>> Thanks!
>>>
>>>
>>
>>
>>
>> --
>> jay vyas
>>
>
>

Reply via email to